Sunday, July 13, 2008

malloc, free, repeat, repeat ad infinitum

In the current development climate, you'll be using one of two types of development environments:

  • An environment where the developer is responsible for allocating and freeing resources
  • An environment where the developer acquires resources, and (generally) the environment takes care of freeing them
Of course, it's not quite so clear-cut. Maintaining a static (or Shared for you VB.Net developers) list of allocated resources will stop any sort of reclaimation from happening, but the distinction is sufficient for this article.

Recently, I had to deal with a delightful issue involving a memory leak on every developer's favourite OS, Windows NT (2000, 2003, XP, Vista, 2008 etc).

For those not familiar with the way that the Windows kernel "works", here's a minimal introduction:

The Windows kernel has two important memory pools, both critical to the correct operation of the kernel. One is the paged pool, able to be swapped to disk as required. The other is the non-paged pool, permanently in RAM. If either of these pools is exhausted, stability will be impacted.


This is why the UNIX world embraces redirection into user space for most high-level tasks. Virus scanning, remote access, graphical interfaces, intrusion detection, complex firewall rules and logging (to name a few that come to mind) are all handled mainly in user space, where crashes and incorrect behaviour are handled with a process kill and respawn.

Windows handles things in a, how shall we say, different way. Drivers for the vast majority of remotely low-level tasks do complex work inside kernel space. There are multiple cases from Network Associates, Symantec, Computer Associates and Roxio of causing massive stability problems due to their drivers. My employer has various requirements mandating the use of various packages from vendors that install an assortment of drivers that do non-trivial things in kernel space. In every case, these drivers have resulted in major stability problems.

This gives rise to two very important questions:

  • How hard can it be to do a decent amount of testing to find these problems, given they require me to reboot my servers every week to "work around" them?
  • Why aren't these drivers a boring, stone-simple stub that pass the required data to a user space process that won't bring down the server in question due to a simple development error?
This week, I was faced with an agonising decision. A server running Windows 2003 Server (Standard Edition, 32 bit) that normally shows a Paged Pool usage of 30MB - 40MB was showing usage of 320MB.

After promptly panicing, I went looking for the cause. This has happened before, and I'd maximised the size of both pools. The idea wasn't to prevent the problem, but rather to at least get the chance to both see the problem and diagnose it. We've ensured that Pool Tagging was enabled on all of the Windows machines we take care of and that Poolmon was readily accessible.

As soon as I saw the problem I fired up Poolmon and saw that the tag for the biggest user of the pool (280 MB of usage) was SevI. A quick check showed this was Symantec's SymEvent driver. And this is where my frustration began:

  • The SymEvent driver isn't advertised as being included with or used by pcAnywhere. It's mainly associated with Norton AntiVirus and Symantec AntiVirus
  • Executing LiveUpdate against pcAnywhere doesn't update the driver
  • The latest version of pcAnywhere no longer uses the SymEvent driver
  • There isn't a program available that will clean up the paged pool usage to allow me to forgo reboot of a critical production server
As much as I disagree with Microsoft's business practices, they actually make very good tools for verifying drivers (PreFAST is actually quite fantastic). You could also use Purify and Valgrind (yes, it runs on Windows). Symantec are also a compiler developer.

Because of some programmer's inability to understand that you should free what you malloc, I was forced to reboot a critical server before the "acceptable" hours of 3am - 6am. I applied the patch, and rebooted to a situation where I am expected to believe the problem is solved.

There are several worrying morals to this awful story:

  • Windows is too complicated. As my boss does say (he's not a developer or sysadmin), you should always seek to make things more simple
  • Many eyes make all bugs shallow. Open source software is held up high by geeks for a reason.
  • Don't use pcAnywhere. There is no possible justification for using it.
  • The companies expecting us to trust them with our host-based security systems expect our uptime in return without consequence (NAI, Symantec and CA are all guilty)
  • We (those in the industry) are cynical because we're given valid reasons to be that way
So to those responsible for my pain, those at the biggest security companies in the world, thanks for bringing down our industry in the eyes of the outsiders. You owe we in the trenches an apology, and it's way overdue.

No comments: