P0

I recently completed a year at Microsoft. I thought I’d take the chance to write about some of the fun anecdotes from my one year here. Blood, sweat, screams tears and joy – I’ve lived through them all in my short stint here.

Nothing in Microsoft is feared as much as a P0 bug. P0, which stands for Priority 0, is the highest importance a bug can be given. You’ll usually see a P0 bug if there is a build break i.e. for some reason, the build didn’t complete or the smoke tests on the new build didn’t pass.

For a Microsoft developer, life stops if you get a P0 bug assigned to you. To quote a colleague, “a P0 means that you drop whatever you’re doing, forget about the outside world, and stop breathing even. The only thing that matters in your life now is that P0 bug”.

As a Program Manager, dealing with coding bugs is usually not a part of my day-to-day work– therefore I’ve never had to deal with P0 bugs for quite a while. Until one late night in August.

August 2005 was a stressful period. We were in the final stages of shipping Visual Studio 2005. This meant that we spent all our time fixing late bugs and cleaning up the product to get it out the door for it’s late October release.

At such a late point in the product’s cycle, there was always the fear that a showstopper bug might come at you from nowhere. Of course, there was no logical reasoning for this fear – the number of bugs being found was dropping drastically. But still, there was Murphy’s Law and the gnawing fear that something might dramatically fail at the last moment.

This particular night, I was up late at office looking at some bug. Finally done with that bug, I packed up my stuff and prepared to head home when a developer on my team asked me if I was free. He had a problem – he was investigating a couple of P0 issues and all of a sudden, a third P0 issue had been assigned to him.

Assigning the bug to myself, I unpacked my bags again and pulled up to my computer. I wasn’t worried – a large number of P0 issues are false alarms and are caused due to some weirdness in the machine running the test or in the automated test code. I’ve rarely seen P0 bugs due to actual issues in the code.

This bug sounded relatively benign. The toolbox was failing to load its controls when you opened a device project. My guess was that it was due to an unclean machine where someone hadn’t uninstalled the previous raw build cleanly.

Let me explain some terminology here. When we did Whidbey, we had two builds of Visual Studio. One was the layout build which we (and you) use with the nice graphical installer and everything. The other build was the ‘raw’ build which just registered the binaries, some debug tools and registry keys. With raw builds. You could build with debug information and with optimizations turned off, so you can debug your way through easily.

In this particular case, our internal testing tools had just installed the raw build on this machine and were running through their usually battery of tests. Everything was hunky-dory until they tried to create a device project, at which point everything came down in flames since there was nothing in the toolbox.

I poked around in the debugger in a bit. After spending 30 minutes or so in the debugger, I knew why the toolbox wasn’t getting populated. For the simple reason that .Net Compact Framework wasn’t installed on the machine.

This was weird – NetCF gets installed along with the rest of Visual Studio and there was no reason for it not to be installed.

I broke into a cold sweat. Not finding NetCF was a pretty serious problem – it would render useless most of the device stuff. Had somebody checked in code which broke the setup? This was a ship stopper bug if I had ever seen one.

I poked through the logs of the raw build’s setup (essentially spew from a huge batch file) and the plot thickened. The log file indicated that the NetCF setup had been run just like the setup for every other part of Visual Studio. But for some weird reason, the files weren’t there. It was as if the setup had never happened. I looked through the event logs and they were a mystery – they told me that NetCF’s installation was kicked off. But for some reason, they never recorded the successful completion or any error.

I tried rerunning setup and voila! – NetCF installed perfectly and all tests passed successfully. I was confused – what had happened that first time? I tried rerunning setup a few times but couldn’t spot the problem. Beaten, I told the tester that I couldn’t ‘repro’ the bug and went home – but not before leaving a note in the bug asking people to assign any similar bugs to me. If this happened again, I wanted to know about it.

I didn’t have to wait long. 24 hours later, there was a mail in my inbox from another tester. A new build had exhibited the same issue. The toolbox wasn’t loading controls.

I plonked myself down and logged in remotely into the test machine. The scene of the crime was identical to the last time. No NetCF. Logs indicating the successful installation of NetCF.

I started poking around in the event logs of the machine. What was happening in the machine around the time of the incident? Was there a hardware failure? Was there a network outage?

The event logs didn’t reveal much. However, one entry caught my eye – just before the entry documenting the NetCF setup, there was an entry about a Systems Management Server operation on the machine. SMS is how we deploy updates to Windows and other internal tools inside Microsoft and they’re very routine around here. However, the reason this caught my eye was due to the fact that I had seen a similar entry in the previous machine.

On both machines, SMS was doing something a few minutes before NetCF’s setup was kicked off.

This was the clincher – I now had a good hunch as to what was happening. I checked in a piece of code to add logging of NetCF’s setup (basically the verbose logging option of msiexec). My code also checked now to see whether the setup failed and if so, it bailed out.

Hardly a day later was my guess vindicated. NetCF installation failed on another machine – but this time, my logs had documented the sequence of events.

For those of you who know the innards of msiexec, you would know that only one setup can happen at any point of time. Msiexec.exe tries to grab a global mutex – if another instance is already holding the mutex, it fails. The reasoning is simple – you don’t want two setup programs to be making changes to the system at the same time as they could trample all over each other.

Visual Studio 2005 setup is made up a few monolithic MSIs and smaller MSIs such as NetCF which are installed as part of the installation process. There is no one huge Visual Studio 2005 .msi file.

Some readers might have spotted the ending now. What was happening was a freak occurrence. In the infinitesimally small time gap between the end of the previous msi’s setup and the beginning of NetCF’s setup, the machine was unlucky enough to encounter a setup request from Systems Management Server. This was causing the NetCF installation to fail and things went rapidly downhill from thereon. The difference this time was that my log files had faithfully recorded all this.

Whew!

I shot off a mail off to the folks running the build machines to make sure the SMS updates happened before and we thankfully never had to deal with this again. We never did find out why this bug had never occurred earlier with other parts of setup. My guess is that something in SMS was changed

Moral of the story – I really don’t have a moral actually. Except maybe, that shipping a product is an incredibly fun and stressful experience all at once.

 

The Microsoft Device Emulator goes shared source!

At BarCamp yesterday, someone asked about the ability to see the source code for the emulator. I was *so* tempted to talk about this but restrained myself.

Well - no more restraining myself.

The Shared Source Device Emulator 1.0 is now available for download under this license

Check out Barry's blog post on the emulator at http://blogs.msdn.com/barrybo/archive/2006/07/17/668492.aspx

Note: You need to download libpng and zlib separately and extract it into the right folders before you can build. Read the 'How to Build' file in the docs folder


 

Bar Camp Musings



Two talks on J2EE, one talk on Opera and Atul Chitnis doing his talk on a Fedora Core machine. Sounds like normal Barcamp material, right? Yes - except that this was all on Microsoft premises here in Hyderabad.

Barcamp Hyderabad 2 was quite fun. We kicked off the day with a talk from Pramati's CEO on the new world of mobility. I was late to this talk (I was upstairs in my office working on my slides :-) ).

When I eventually found my way downstairs, I was greeted by the sight of a Fedora Core laptop (this one) booting up on the big display. Definitely not a sight you see around here too often :-).

Atul Chitnis did a talk on ..well..why mobile devices are so huge. And why we need to stop thinking in terms of PC and even to stop using the word 'computing'. Backed up by Larry Lessig-esque slides[PDF] (Atul told me that he even got the same font that Lessig uses - from Lessig himself!), this talk was probably the best of the day.

His essential argument revolved around how mobile devices adapt to how you want to use them - rather than how they want to be used. I didn't buy his arguments about screen displays - I firmly believe that large screen monitors increase productivity if you're trying to do work and are awesome to watch movies on.

Apart from that, I liked this talk for one very simple reason - this is one of the few times I've seen someone talk of how mobility is *cool* without saying 'Mobility is cool since it has so many users and will have so many users'.

Mobile devices are not cool because millions (billions?) of people use them and will be using them. Mobile devices are cool because of the way in which they can change people's lives. There are quite a few people I know who don't understand this distinction very well.

After his talk, I went up to Atul and introduced myself. We have a long history (I've heckled him and generally been a pain in all the wrong places over the years :-) ) but this was the first time we've met in person.

I was very pleasantly surprised - unlike some of my past encounters with the open source world, I thoroughly enjoyed the conversations I had with Atul. Over the course of the day, we touched a *lot* of topics - from Windows Vista to Microsoft to the GPL v3 to why he doesn't like Windows Genuine Advantage to the sad state of computer science in colleges today. In fact, we (Aarthi, Atul and I) pretty much missed the rest of the Barcamp talks and spent the day 'hanging out'. He may not like our software but he sure does like our coffee :-)

My own talk went reasonably well. Things get interesting when you try to squeeze a 80 minute talk into 15 minutes but I think the crowd was happy with the general outcome. More than the talk itself, I loved the conversations that ensued and the people I got to meet as a result thereof.

One of the folks I ran into was Asshar Farhan from Spokn. We got to see a little demo of Spokn and it is an interesting Skype competitor in the mobile VOIP space. I had an interesting 3-way conversation with Asshar, myself and one of the people running the Office Communicator Mobile team.

Spokn is also interesting to me personally as they use my team's product! It is written using Visual Studio 2005. Interestingly enough, it is written in C and not using the Compact Framework.

My only criticism, if any, of the entire day was that it really wasn't an 'unconference'. There really wasn't anyone stepping up to share what they knew/wanted to talk about. We did have a couple of interesting group gatherings but overall, the feel was no different from any other conference (though this had a looser style).

I'm really not sure why this is the case. Is it a cultural thing? Or was it a function
of the speaker/theme/venue? I'm really not sure. I sure would like to see more open-ended discussions next time around.

Best comment of the day - Atul Chitnis, when we escorted him to the front gate. "You can tell your superiors that you escorted Atul Chitnis to the gate and *made sure* that he left the Microsoft campus".

Hey Atul, we're not so evil. Judge us by our coffee :-)

 

Feed mess

Something I've been working on at work deals with feeds - I have to read, parse and derive meaning out of RSS and Atom feeds in the wild.

And it's not been fun.

The Universal Feed Parser is nice and everything but I'm still being forced to debug through weird Unicode issues. Or issues like some Feedburner feeds having the original permalink in feedburner:origLink

Well, atleast I'm not the only one facing these issues. Sam Ruby maintains a list here and the Google Reader team talks of the problems they've run into (and I've run into a lot of them as well)

 

BarCamp Hyderabad 2

I'll be speaking at BarCamp Hyderabad 2 so if you have some free time tommorow and if you're in and around Hyderabad, do drop in at the Microsoft campus.

I'll be talking on mobile development for 'Web 2.0' and showing off Visual Studio in the process. This is mostly a rehash of my MEDC talk but I'm in panic mode right now as my laptop is throwing tantrums and just plain refusing to run any of my demos.

For those of you who can't make it tommorow and can't wait for me to post the slides, here's the main takeaways

On a personal note, I'm looking forward to finally meeting Atul Chitnis in person. He's someone whom I've had multiple umm... friendly email conversations with :-)


 

Why 'send' doesn't mean 'receive'

Did you ever have this happen to you? You think you understand a piece of technology really well because it is so fundamental. But then someone comes and exposes how much you misunderstood something.

This happened to me today in a question regarding BSD sockets.

The debate was this - when a 'send()' call returns success, what does it *mean*? Does it mean

I'm willing to bet that a lot of you are confused now (or I hope so atleast for the sake of my ego). What did 'send()' actually succeed at?

The correct answer is the last one - that the message was successfully copied into some buffer into kernel space. In fact, one of the ways in which send() can fail is if the buffer doesn't have enough space to squeeze in your data.

You can see this specifically mentioned at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/send_2.asp

Why was I confused? Probably because my mental image of 'send()' has always been 'the network transmission call that blocks' and I associated blocking with the slow act of sending data over the network.

In case anyone is wondering, this debate didn't arise due to BSD sockets. We have an internal library that is supposed to 'mimic' BSD sockets semantics and I was sure I had found a bug when I found the 'send()' call succeeding before the other side had got the data.

You learn something new everyday. What's next? A bug in my hello world programs?


Archives

November 2004   January 2006   June 2006   July 2006   August 2006   September 2006   October 2006   November 2006   December 2006   January 2007   February 2007   March 2007   April 2007   May 2007   June 2007   July 2007   August 2007   September 2007   October 2007   December 2007   January 2008   February 2008   March 2008