The best complement that Popfly has ever received

Lots of people have said lots of things lots of things about Popfly. One of my colleagues forwarded an article (you need to be a subscriber to view it) by Chris Pirillo in Computer Power User magazine  where he says a bunch of nice things about Popfly (thanks Chris!). However, he wraps it up by saying something about Popfly which definitely has to be a first.

"..Even Paris Hilton could use Popfly.."

Forget the dog and the duck and all our current branding. I think we should use this as our tagline from now on. :-)


 

Open source and scratching itches in the cloud

[Note : I happen to work at Microsoft but do read and understand the disclaimer at the bottom. This is just the geek in me talking. I'll get very annoyed if someone links to this with the heading "Microsoft employee says...".  I don't even work in the Live org so don't even start thinking that this represents any future direction/future products from Microsoft. I work in Developer Division so if I ever say that we're going to bring back Algol-60, then you're onto something ;-)]

I read a couple of posts today which made me think about open source in web 2.0 - the first is Tim O'Reilly on Yahoo supporting Hadoop and the other is Dare on 'Free Data'. Tim has this interesting factoid about Nutch in his post

"...Some years ago, I was on the board of Doug's open source search engine effort, Nutch. Where the project foundered was in not having a large enough data set to really prove out the algorithms. Having more than a couple of hundred million pages in the index was too expensive for a non-profit open source project to manage. One of the important truths of Web 2.0 is that it ain't the personal computer era any more, Eben Moglen's arguments to the contrary notwithstanding. A lot of really important software can't even be exercised properly without very large networks of machines, very large data sets, and heavy performance demands..."

And from Dare

"...This leads to a tendency for the rich to get richer because since they have the most data they provide the most value for end users (e.g. Amazon). Another problem is that social software leads to lock-in.  My buddy list on Windows Live Messenger and my list of friends in Facebook are useless to me outside the context of these applications. Although I can get all of my history and data out of these services, I lose the value I get from the fact that all my friends use these services as well...."

Based on the above (which I agree with), I'm going to posit the following.

  1. Most of the value in Web 2.0 comes from data (generated by lots of users). E.g - del.ico.us, Digg, Facebook
  2. It is beyond the financial means of any one person to run large scale server software E.g Gmail, Google Search, Live, Y! Search
  3. Companies will fight hard to keep their users (and their data) in their system as that's what will keep them alive. You need the ad money from your users to pay your bills. At the least, moving to a competing service will not be painless. (pretty much any service which requires a log in)

Hacking the code

Now, let's take these three and juxtapose them with open source and in particular, where open source came from. For me, the powerful idea about open source has always been geeks scratching their itches. Find a bug in Emacs? Checkout the code from whichever CVS server it is on, fix the bug and then do the configure-make-make install dance. You could submit your change back as well (though YMMV on whether it gets accepted or not).

The same is not possible with any large cloud-based server. If you find something you dislike in Gmail or Windows Live Hotmail which can't be fixed in the browser through some GreaseMonkey magic, you're stuck. The same goes for Facebook, Digg, etc 1. If you are have an 'itch' to scratch (and you are one of those people who wouldn't use Microsoft software if there was only PC left in the world and it ran Windows <grin/>), you probably have to contribute at one of these levels

1. Change things in Linux/*BSD which powers some of these sites

2. Change things in Apache/Lighttpd/Mongrel/memcached/MySQL

3. Change things in PHP/Ruby/Python

4. Change things using the API provided by the site. 

There's a huge gap between #3 and #4 there - that's where the Facebook backend code lives or Google's backend code (or Windows Live code if you swap in the stuff that the Live guys use). This is code that you don't get to see or modify and you pretty much have no influence over.

Code might be data but I'd rather just have data 

The above problem is solvable if Google and Facebook and Microsoft want to - they could go the Slashdot/Wikipedia way and actually put out backend code up on CodePlex or SF. But you'll probably soon find that the code doesn't have as much value as having the data to back it up.

Example - look at MySpace. It would be possible for a bunch of coders to take MySpace's backend and create a modified version of MySpace in a few days. However, it is useless without the data of the millions of users who make up MySpace. Obviously, this is not data that should fall into Joe RandomGeek's hands without the consent of the users themselves. You can see how this soon turns into a very hard problem to solve.

The only site I know which gives away both code and data is Wikipedia. It is theoretically possible to take MediaWiki and the WIkipedia dumps and have the equivalent of the CVS checkout I first talked about. You are free to hack on MediaWiki and since you have the data, you are free to do whatever you want with it. I'm curious as to why there has not been a lot of interesting work done on top of Wikipedia's data - the only notable work I know of is Aaron Swartz's when he ran for office inside Wikipedia.

Even if you have both code and data, you run into the third and arguably the toughest problem

It's all about the hardware, stupid

Your typical long-haired geek probably can't get his hands on the following from the comfort of his bedroom 2

Constructing one by yourself is out of the question (unless you are an expert on topics like air conditioning and setting up multiple diesel generators which are better than these). The other option is to shell out hard cash to get some racks on someone else's datacenter. And *that* costs a lot more than what your average college-going hacker (don't you love stereotypes?) can afford. Or even what I can afford :-)

Possible answers

I'm not really saying 'gloom and doom' here. I think we'll see consolidation and commodification of several of the above pieces.

"Hmm..so you want a few thousand machines. x64 boxes with oodles of RAM, you say? In Asia as well as Europe? Ok - one order coming right up. Would you like some chips with that, Sir" ?

As a geek, it is mouth watering to think of the possibilities when that eventually happens. In reality, instead of specifying machines, you'll probably just call some service and the magic (how many machines, which data center to serve the request from ) will happen transparently on the backend.

This needs to happen at multiple levels. At the datacenter/ geo-distribution level Amazon's S3 and EC2 provide great features at a great price point . However, you still need to run your servers...today. A few years down the road, you could theoretically lease a bunch of machines from Google, Yahoo, Microsoft or Amazon and you don't need to worry about machines going down, coming up, failovers across continents - it'll 'just work'.

In terms of code, Doug Cutting's projects and projects like memcached means that you already have a lot of backend code to play with. From the Microsoft side of things, you have everything from Windows Server to IIS to SQL Server which definitely do scale very well - Microsoft runs them *everywhere*.

The really tricky problem is data. No company wants to part with their data as it really is the 'crown jewels'. If an alternative social network sprung up which had all of Facebook's data and users, Facebook could go out of business very soon. This is where the 'Free Data' movement comes in (see Dare's post for more). I still don't see a good model of getting to everyone's data without trampling all over privacy and security. This may be a problem that never gets solved.

Whatever happens, there'll soon be a 'cloud software' equivalent to './configure, make, make install' and it is going to be fun to watch and see what it turns out to be. Until then, if you've got an itch, I'm afraid you can't scratch.

 

Notes:

1. LiveJournal might be an exception here

2. I'm not sure how far along Doug Cutting's open source equivalents have come along for these - so that might be an answer


Archives

November 2004   January 2006   June 2006   July 2006   August 2006   September 2006   October 2006   November 2006   December 2006   January 2007   February 2007   March 2007   April 2007   May 2007   June 2007   July 2007   August 2007   September 2007   October 2007   December 2007   January 2008   February 2008   March 2008