- An environment where the developer is responsible for allocating and freeing resources
- An environment where the developer acquires resources, and (generally) the environment takes care of freeing them
Recently, I had to deal with a delightful issue involving a memory leak on every developer's favourite OS, Windows NT (2000, 2003, XP, Vista, 2008 etc).
For those not familiar with the way that the Windows kernel "works", here's a minimal introduction:
The Windows kernel has two important memory pools, both critical to the correct operation of the kernel. One is the paged pool, able to be swapped to disk as required. The other is the non-paged pool, permanently in RAM. If either of these pools is exhausted, stability will be impacted.
This is why the UNIX world embraces redirection into user space for most high-level tasks. Virus scanning, remote access, graphical interfaces, intrusion detection, complex firewall rules and logging (to name a few that come to mind) are all handled mainly in user space, where crashes and incorrect behaviour are handled with a process kill and respawn.
Windows handles things in a, how shall we say, different way. Drivers for the vast majority of remotely low-level tasks do complex work inside kernel space. There are multiple cases from Network Associates, Symantec, Computer Associates and Roxio of causing massive stability problems due to their drivers. My employer has various requirements mandating the use of various packages from vendors that install an assortment of drivers that do non-trivial things in kernel space. In every case, these drivers have resulted in major stability problems.
This gives rise to two very important questions:
- How hard can it be to do a decent amount of testing to find these problems, given they require me to reboot my servers every week to "work around" them?
- Why aren't these drivers a boring, stone-simple stub that pass the required data to a user space process that won't bring down the server in question due to a simple development error?
After promptly panicing, I went looking for the cause. This has happened before, and I'd maximised the size of both pools. The idea wasn't to prevent the problem, but rather to at least get the chance to both see the problem and diagnose it. We've ensured that Pool Tagging was enabled on all of the Windows machines we take care of and that Poolmon was readily accessible.
As soon as I saw the problem I fired up Poolmon and saw that the tag for the biggest user of the pool (280 MB of usage) was SevI. A quick check showed this was Symantec's SymEvent driver. And this is where my frustration began:
- The SymEvent driver isn't advertised as being included with or used by pcAnywhere. It's mainly associated with Norton AntiVirus and Symantec AntiVirus
- Executing LiveUpdate against pcAnywhere doesn't update the driver
- The latest version of pcAnywhere no longer uses the SymEvent driver
- There isn't a program available that will clean up the paged pool usage to allow me to forgo reboot of a critical production server
Because of some programmer's inability to understand that you should free what you malloc, I was forced to reboot a critical server before the "acceptable" hours of 3am - 6am. I applied the patch, and rebooted to a situation where I am expected to believe the problem is solved.
There are several worrying morals to this awful story:
- Windows is too complicated. As my boss does say (he's not a developer or sysadmin), you should always seek to make things more simple
- Many eyes make all bugs shallow. Open source software is held up high by geeks for a reason.
- Don't use pcAnywhere. There is no possible justification for using it.
- The companies expecting us to trust them with our host-based security systems expect our uptime in return without consequence (NAI, Symantec and CA are all guilty)
- We (those in the industry) are cynical because we're given valid reasons to be that way
No comments:
Post a Comment