Open source and scratching itches in the cloud
[Note : I happen to work at Microsoft but do read and understand the disclaimer at the bottom. This is just the geek in me talking. I'll get very annoyed if someone links to this with the heading "Microsoft employee says...". I don't even work in the Live org so don't even start thinking that this represents any future direction/future products from Microsoft. I work in Developer Division so if I ever say that we're going to bring back Algol-60, then you're onto something ;-)]
I read a couple of posts today which made me think about open source in web 2.0 - the first is Tim O'Reilly on Yahoo supporting Hadoop and the other is Dare on 'Free Data'. Tim has this interesting factoid about Nutch in his post
"...Some years ago, I was on the board of Doug's open source search engine effort, Nutch. Where the project foundered was in not having a large enough data set to really prove out the algorithms. Having more than a couple of hundred million pages in the index was too expensive for a non-profit open source project to manage. One of the important truths of Web 2.0 is that it ain't the personal computer era any more, Eben Moglen's arguments to the contrary notwithstanding. A lot of really important software can't even be exercised properly without very large networks of machines, very large data sets, and heavy performance demands..."
And from Dare
"...This leads to a tendency for the rich to get richer because since they have the most data they provide the most value for end users (e.g. Amazon). Another problem is that social software leads to lock-in. My buddy list on Windows Live Messenger and my list of friends in Facebook are useless to me outside the context of these applications. Although I can get all of my history and data out of these services, I lose the value I get from the fact that all my friends use these services as well...."
Based on the above (which I agree with), I'm going to posit the following.
- Most of the value in Web 2.0 comes from data (generated by lots of users). E.g - del.ico.us, Digg, Facebook
- It is beyond the financial means of any one person to run large scale server software E.g Gmail, Google Search, Live, Y! Search
- Companies will fight hard to keep their users (and their data) in their system as that's what will keep them alive. You need the ad money from your users to pay your bills. At the least, moving to a competing service will not be painless. (pretty much any service which requires a log in)
Hacking the code
Now, let's take these three and juxtapose them with open source and in particular, where open source came from. For me, the powerful idea about open source has always been geeks scratching their itches. Find a bug in Emacs? Checkout the code from whichever CVS server it is on, fix the bug and then do the configure-make-make install dance. You could submit your change back as well (though YMMV on whether it gets accepted or not).
The same is not possible with any large cloud-based server. If you find something you dislike in Gmail or Windows Live Hotmail which can't be fixed in the browser through some GreaseMonkey magic, you're stuck. The same goes for Facebook, Digg, etc 1. If you are have an 'itch' to scratch (and you are one of those people who wouldn't use Microsoft software if there was only PC left in the world and it ran Windows <grin/>), you probably have to contribute at one of these levels
1. Change things in Linux/*BSD which powers some of these sites
2. Change things in Apache/Lighttpd/Mongrel/memcached/MySQL
3. Change things in PHP/Ruby/Python
4. Change things using the API provided by the site.
There's a huge gap between #3 and #4 there - that's where the Facebook backend code lives or Google's backend code (or Windows Live code if you swap in the stuff that the Live guys use). This is code that you don't get to see or modify and you pretty much have no influence over.
Code might be data but I'd rather just have data
The above problem is solvable if Google and Facebook and Microsoft want to - they could go the Slashdot/Wikipedia way and actually put out backend code up on CodePlex or SF. But you'll probably soon find that the code doesn't have as much value as having the data to back it up.
Example - look at MySpace. It would be possible for a bunch of coders to take MySpace's backend and create a modified version of MySpace in a few days. However, it is useless without the data of the millions of users who make up MySpace. Obviously, this is not data that should fall into Joe RandomGeek's hands without the consent of the users themselves. You can see how this soon turns into a very hard problem to solve.
The only site I know which gives away both code and data is Wikipedia. It is theoretically possible to take MediaWiki and the WIkipedia dumps and have the equivalent of the CVS checkout I first talked about. You are free to hack on MediaWiki and since you have the data, you are free to do whatever you want with it. I'm curious as to why there has not been a lot of interesting work done on top of Wikipedia's data - the only notable work I know of is Aaron Swartz's when he ran for office inside Wikipedia.
Even if you have both code and data, you run into the third and arguably the toughest problem
It's all about the hardware, stupid
Your typical long-haired geek probably can't get his hands on the following from the comfort of his bedroom 2
- Geo-distributed datacenters which have scary electrical,networking, legal, social and financial design issues to worry about.
- Large scale blob storage (S3, Google File System)
- Large scale structured storage ( Google's Bigtable)
- Tools to run code across such infrastructure (MapReduce, Dryad).
Constructing one by yourself is out of the question (unless you are an expert on topics like air conditioning and setting up multiple diesel generators which are better than these). The other option is to shell out hard cash to get some racks on someone else's datacenter. And *that* costs a lot more than what your average college-going hacker (don't you love stereotypes?) can afford. Or even what I can afford :-)
Possible answers
I'm not really saying 'gloom and doom' here. I think we'll see consolidation and commodification of several of the above pieces.
"Hmm..so you want a few thousand machines. x64 boxes with oodles of RAM, you say? In Asia as well as Europe? Ok - one order coming right up. Would you like some chips with that, Sir" ?
As a geek, it is mouth watering to think of the possibilities when that eventually happens. In reality, instead of specifying machines, you'll probably just call some service and the magic (how many machines, which data center to serve the request from ) will happen transparently on the backend.
This needs to happen at multiple levels. At the datacenter/ geo-distribution level Amazon's S3 and EC2 provide great features at a great price point . However, you still need to run your servers...today. A few years down the road, you could theoretically lease a bunch of machines from Google, Yahoo, Microsoft or Amazon and you don't need to worry about machines going down, coming up, failovers across continents - it'll 'just work'.
In terms of code, Doug Cutting's projects and projects like memcached means that you already have a lot of backend code to play with. From the Microsoft side of things, you have everything from Windows Server to IIS to SQL Server which definitely do scale very well - Microsoft runs them *everywhere*.
The really tricky problem is data. No company wants to part with their data as it really is the 'crown jewels'. If an alternative social network sprung up which had all of Facebook's data and users, Facebook could go out of business very soon. This is where the 'Free Data' movement comes in (see Dare's post for more). I still don't see a good model of getting to everyone's data without trampling all over privacy and security. This may be a problem that never gets solved.
Whatever happens, there'll soon be a 'cloud software' equivalent to './configure, make, make install' and it is going to be fun to watch and see what it turns out to be. Until then, if you've got an itch, I'm afraid you can't scratch.
Notes:
1. LiveJournal might be an exception here
2. I'm not sure how far along Doug Cutting's open source equivalents have come along for these - so that might be an answer
Nice post. I
There are a few data sets that we can probably use:
- RSS Feeds (news and blogs)
- FOAF Profiles (not enough though)
- DBpedia.org is a community effort to extract structured information from Wikipedia
There may be others.
"The same is not possible with any large cloud-based server. If you find something you dislike in Gmail or Windows Live Hotmail which can't be fixed in the browser through some GreaseMonkey magic, you're stuck. The same goes for Facebook, Digg, etc 1. If you are have an 'itch' to scratch (and you are one of those people who wouldn't use Microsoft software if there was only PC left in the world and it ran Windows),"
With Ning (http://www.ning.com), you can get access to the full source code if you want (or if you're not a programmer you can skip this option). Then you can modify anything you want.
Then, regarding, "The really tricky problem is data. No company wants to part with their data"
Every single function and piece of data in the Ning socialnetwork is also available as a set of fully public APIs-- authenticated for access by the owner of course--(see http://developer.ning.com/2007/07/07/one_small_rest_call_for_man2c_one_giant_api_for_mankind/) and they're (and always will be) fully supported by us because we've built the network using them -- there's no secret sauce. You can read more at the developer blog and other places in Ning.
Given that you've clearly thought this topic through I'd be interested to know what you think -- please let me know!
thanks,
diego
I'm still not sold on the extent of features you could build without having *all* the data in the system (rather than just the stuff you own). Example - a long time ago, Orkut built a 'Related Communities' feature which was built on doing some collaborative filtering on top of the Orkut data. Let's say you want to build the top on Facebook - you can't do it unless you know what groups everyone is a part of. And if some site has a privacy setting which lets someone hide that, then you're stuck.
Having said that, Ning is definitely doing something interesting.
Having said that, the facebook *platform* is significantly different, albeit scary... take a look at it, might be something you like...
Now utility computing startup 3tera I've considered creating "project accounts" for some friends in the OS community. A place where ad hoc teams could store data repositories and run test suites. You've got me curious whether this would actually work and I'll have to go ponder the possibility . . .
<< Home
Archives
November 2004 January 2006 June 2006 July 2006 August 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 September 2007 October 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 June 2008 July 2008 August 2008 September 2008 October 2008 November 2008 December 2008 January 2009 April 2009

