When it comes to debugging hard-to-diagnose software and operating-system problems, there is no set recipe. Rather debugging is all about “having the right tools and knowing how to use them,” advised Microsoft technical fellow Mark Russinovich at the close of the Microsoft TechEd conference, held this week in Atlanta.
Among the highlights of each year’s TechEd conference are the technology demonstrations. Smart Microsoft and partner engineers walk attendees through how to use some new technology in a step-by-step process, making it seem easy — or even fun — to deploy.
And one of the most popular demonstrations over the past few years has been Russinovich’s “Cases of the Unexplained,” in which he shows how he and others tracked down hard-to-pinpoint errors in Windows deployments.
This year, of course, was no exception. Before a packed auditorium, Russinovich debugged a number of tricky problems using only a handful of free tools, many created by Russinovich himself, including Process Explorer and Process Monitor. He borrowed many examples in his presentation from his blog, where he collects user stories of tough problems.
In the cases Russinovich demonstrated, the root causes of the misbehaving systems were not readily obvious. This was especially true of software that, he noted, when it crashes, offers little instruction about its downfall. “Programs do a bad job of telling what went wrong,” he said. Yet he showed that it is possible to carefully track the symptom of the problem back to the cause.
One example Russinovich dubbed “the case of the slow website.” This example was submitted to Russinovich by a system administrator from an unnamed company. The organization’s users were complaining of slow performance of some internal Web pages. The admin tracked all the Web pages to a single server, then ran Process Explorer, which shows all the processes on a server, and how much memory and CPU resources each thread of a process is consuming.
The admin identified one thread that was hogging more than a quarter of the server’s resources. Doing a Web search, he found that the related process belonged to a Windows management driver that, in turn, communicated with the server chassis’ management controller provided by the server manufacturer. The two components were having difficulty in communicating, so the communication between them spiked.
The difficulty turned out to be that the blade server was not slotted into the rack appropriately. The user reseated the server chassis and the server quickly returned to delivering its Web pages speedily.
Another problem came not from misbehaving equipment or software, but rather from user behavior. “This case came into the Microsoft Exchange support team,” Russinovich said.
Users complained that Microsoft Exchange would periodically delay responding for up to 30 seconds. Microsoft requested the customer to log the server performance using Performance Monitor, which showed periodic spikes in CPU utilization. Using ProcDump, a Microsoft engineer created a script that would capture all the process information whenever processor usage went above a certain threshold.