What Twitter and everyone else needs right now - distributed programming tools

 

We've gone through our various databases, caches, web servers, daemons, and despite some increased traffic activity across the board, all systems are running nominally. The truth is we're not sure what's happening.

The quote and image above are from the much talked-about Twitter blog post. Alex Payne has a longer post detailing their architecture issues here (Techmeme discussion).

The problem Twitter's dev team is doing right now in diagnosing their problems is something I've been thinking about for some time. As a programmer, your top diagnostic tools are probably your debugger, your profiler and a logging system. Now, take a look at the simplified architecture diagram I created below of a typical high traffic web application (note my complete lack of artistic skill).

image

Notice anything wrong? The tools that we just spoke about help us only on each individual node. They do a terrible job of traversing program flow across machines. Given that any self-respecting service is running on multiple machines, most dev teams find out the hard way that tracking down distributed bugs is very hard. Take Twitter for example - by their own admission, their problems lie not in the boxes themselves but in the arrows between them. This problem is compounded due to the fact that your app code is a very small part of this system. It may be possible to build extensive logging into your code but having to trace values across memcached/Cacheman and MySql/Sql Server and your load balancer is non-trivial, if not impossible.

I think there's a tremendous opportunity in solving this problem. Unfortunately, I don't know what exact shape or form it will take (I do have a few guesses). Here's my wishlist for what such a solution should have

I haven't found any current system which does all the above to my liking. I'm writing some code in my spare time to get some of the above but it is a daunting task. The closest I've seen anyone come is the XTrace project from Berkeley's RAD labs. The following excerpt from the paper explains how it works

A user or operator invokes X-Trace when initiating an application task (e.g., a webrequest), by inserting X-Trace metadata with a task identifier in the resulting request. This metadata is then propagated down to lower layers through protocol interfaces (which may need to be modified to carry X-Trace metadata), and also along all recursive requests that result from the original task. This is what makes X-Trace comprehensive; it tags all network operations resulting from a particular task with the same task identifier.

The problem with systems like X-Trace is that they require extensive modifications to database servers, your web server, your cache server, etc. Such a task is not easy, especially given the wide array of choices you have in each of those. However, X-Trace is promising and I know that they've carried out some large scale tests with good results.

Does anyone have pointers to tools and solutions which can help with the above?


Comments:
For starters, there is Microsoft Research: https://research.microsoft.com/sr/

Their paper on instrumentation in distributed systems: http://research.microsoft.com/research/pubs/view.aspx?pubid=1811

If you don't have an ACM account yet, get one and search on portal.acm.org

I vaguely recall there were quite a few papers on this back when the Amoeba OS was all the research rage, too.

But that's all I know of - papers and research implementations. I don't think there are any real-world solutions yet. So I guess your best bet is working with the MS Research guys to get early access to their tools.
 
Erlang has a pretty extensive library for distributed work but I don't know what debugging it has...however, if I were building a distributed system like Twitter, Erlang is what I would use, no question. It's the only system that's built from the ground up to do this sort of thing.
 
The E developers have Causeway a distributed causality debugger.
 
No, what the world needs is _decentralized_ programming tools.

Distributed, centralized systems just aren't that interesting.
 
Hi Sriram,

Nice post! I developed a distributed programming tool that is open source at: http://www.codeplex.com/gsb

May or may not be modified to meet your requirements.

Cheers!
 
Check out Splunk. It's a database and search engine for IT data.

Before I started working a Splunk, I had my developers at Zoto start using it to eat our ZXTM, Apache, and Lightty logs including tracebacks from our code which is written in Python/Twisted.
 
hey,
i agree that its pretty sticky situation for twitter. But there's definitely no dearth of ideas or redesigns or talent out there.

i've made a post on how erlang comes with several nifty utilities, as well as my experiences in distributing tasks in erlang.


Keep Clicking,
Bhasker V Kode
 
Post a Comment



<< Home

Archives

November 2004   January 2006   June 2006   July 2006   August 2006   September 2006   October 2006   November 2006   December 2006   January 2007   February 2007   March 2007   April 2007   May 2007   June 2007   July 2007   August 2007   September 2007   October 2007   December 2007   January 2008   February 2008   March 2008   April 2008   May 2008   June 2008