What Twitter and everyone else needs right now - distributed programming tools

We've gone through our various databases, caches, web servers, daemons, and despite some increased traffic activity across the board, all systems are running nominally. The truth is we're not sure what's happening.
The quote and image above are from the much talked-about Twitter blog post. Alex Payne has a longer post detailing their architecture issues here (Techmeme discussion).
The problem Twitter's dev team is doing right now in diagnosing their problems is something I've been thinking about for some time. As a programmer, your top diagnostic tools are probably your debugger, your profiler and a logging system. Now, take a look at the simplified architecture diagram I created below of a typical high traffic web application (note my complete lack of artistic skill).

Notice anything wrong? The tools that we just spoke about help us only on each individual node. They do a terrible job of traversing program flow across machines. Given that any self-respecting service is running on multiple machines, most dev teams find out the hard way that tracking down distributed bugs is very hard. Take Twitter for example - by their own admission, their problems lie not in the boxes themselves but in the arrows between them. This problem is compounded due to the fact that your app code is a very small part of this system. It may be possible to build extensive logging into your code but having to trace values across memcached/Cacheman and MySql/Sql Server and your load balancer is non-trivial, if not impossible.
I think there's a tremendous opportunity in solving this problem. Unfortunately, I don't know what exact shape or form it will take (I do have a few guesses). Here's my wishlist for what such a solution should have
- Distributed debugging I want the ability to seamlessly debug from one machine to another, moving between web frontend code, SQL stored procedures, any messaging infrastructure and back. For places where I can't debug, I want to see all data coming and coming out.
- Establishing causality For post-mortem distributed debugging, you are always trying to establish causality - what event lead to what other event. Such a system will log enough data to establish causality relationships between various events in the system. For a particular request or stack trace, I should be able to go back and look at code flow across machines for that request.
- Distributed profiling Anyone who has maintained an public facing site has seen unexplained slowdowns which are hard to debug. In the very early days of Popfly, we used to see an unexplained slowdown everyday at lunchtime (we used to joke about the servers going out to lunch). A tool that constantly monitors server performance and in case of a slowdown, shows you where time is spent across the system will be critical. Tools like Nagios do a great deal of monitoring for you but I haven't seen any tool do the latter.
I haven't found any current system which does all the above to my liking. I'm writing some code in my spare time to get some of the above but it is a daunting task. The closest I've seen anyone come is the XTrace project from Berkeley's RAD labs. The following excerpt from the paper explains how it works
A user or operator invokes X-Trace when initiating an application task (e.g., a webrequest), by inserting X-Trace metadata with a task identifier in the resulting request. This metadata is then propagated down to lower layers through protocol interfaces (which may need to be modified to carry X-Trace metadata), and also along all recursive requests that result from the original task. This is what makes X-Trace comprehensive; it tags all network operations resulting from a particular task with the same task identifier.
The problem with systems like X-Trace is that they require extensive modifications to database servers, your web server, your cache server, etc. Such a task is not easy, especially given the wide array of choices you have in each of those. However, X-Trace is promising and I know that they've carried out some large scale tests with good results.
Does anyone have pointers to tools and solutions which can help with the above?
Their paper on instrumentation in distributed systems: http://research.microsoft.com/research/pubs/view.aspx?pubid=1811
If you don't have an ACM account yet, get one and search on portal.acm.org
I vaguely recall there were quite a few papers on this back when the Amoeba OS was all the research rage, too.
But that's all I know of - papers and research implementations. I don't think there are any real-world solutions yet. So I guess your best bet is working with the MS Research guys to get early access to their tools.
Distributed, centralized systems just aren't that interesting.
Nice post! I developed a distributed programming tool that is open source at: http://www.codeplex.com/gsb
May or may not be modified to meet your requirements.
Cheers!
Before I started working a Splunk, I had my developers at Zoto start using it to eat our ZXTM, Apache, and Lightty logs including tracebacks from our code which is written in Python/Twisted.
i agree that its pretty sticky situation for twitter. But there's definitely no dearth of ideas or redesigns or talent out there.
i've made a post on how erlang comes with several nifty utilities, as well as my experiences in distributing tasks in erlang.
Keep Clicking,
Bhasker V Kode
<< Home
Archives
November 2004 January 2006 June 2006 July 2006 August 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 September 2007 October 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 June 2008

