Showing posts with label IT. Show all posts
Showing posts with label IT. Show all posts

Sunday, July 13, 2008

malloc, free, repeat, repeat ad infinitum

In the current development climate, you'll be using one of two types of development environments:

  • An environment where the developer is responsible for allocating and freeing resources
  • An environment where the developer acquires resources, and (generally) the environment takes care of freeing them
Of course, it's not quite so clear-cut. Maintaining a static (or Shared for you VB.Net developers) list of allocated resources will stop any sort of reclaimation from happening, but the distinction is sufficient for this article.

Recently, I had to deal with a delightful issue involving a memory leak on every developer's favourite OS, Windows NT (2000, 2003, XP, Vista, 2008 etc).

For those not familiar with the way that the Windows kernel "works", here's a minimal introduction:

The Windows kernel has two important memory pools, both critical to the correct operation of the kernel. One is the paged pool, able to be swapped to disk as required. The other is the non-paged pool, permanently in RAM. If either of these pools is exhausted, stability will be impacted.


This is why the UNIX world embraces redirection into user space for most high-level tasks. Virus scanning, remote access, graphical interfaces, intrusion detection, complex firewall rules and logging (to name a few that come to mind) are all handled mainly in user space, where crashes and incorrect behaviour are handled with a process kill and respawn.

Windows handles things in a, how shall we say, different way. Drivers for the vast majority of remotely low-level tasks do complex work inside kernel space. There are multiple cases from Network Associates, Symantec, Computer Associates and Roxio of causing massive stability problems due to their drivers. My employer has various requirements mandating the use of various packages from vendors that install an assortment of drivers that do non-trivial things in kernel space. In every case, these drivers have resulted in major stability problems.

This gives rise to two very important questions:

  • How hard can it be to do a decent amount of testing to find these problems, given they require me to reboot my servers every week to "work around" them?
  • Why aren't these drivers a boring, stone-simple stub that pass the required data to a user space process that won't bring down the server in question due to a simple development error?
This week, I was faced with an agonising decision. A server running Windows 2003 Server (Standard Edition, 32 bit) that normally shows a Paged Pool usage of 30MB - 40MB was showing usage of 320MB.

After promptly panicing, I went looking for the cause. This has happened before, and I'd maximised the size of both pools. The idea wasn't to prevent the problem, but rather to at least get the chance to both see the problem and diagnose it. We've ensured that Pool Tagging was enabled on all of the Windows machines we take care of and that Poolmon was readily accessible.

As soon as I saw the problem I fired up Poolmon and saw that the tag for the biggest user of the pool (280 MB of usage) was SevI. A quick check showed this was Symantec's SymEvent driver. And this is where my frustration began:

  • The SymEvent driver isn't advertised as being included with or used by pcAnywhere. It's mainly associated with Norton AntiVirus and Symantec AntiVirus
  • Executing LiveUpdate against pcAnywhere doesn't update the driver
  • The latest version of pcAnywhere no longer uses the SymEvent driver
  • There isn't a program available that will clean up the paged pool usage to allow me to forgo reboot of a critical production server
As much as I disagree with Microsoft's business practices, they actually make very good tools for verifying drivers (PreFAST is actually quite fantastic). You could also use Purify and Valgrind (yes, it runs on Windows). Symantec are also a compiler developer.

Because of some programmer's inability to understand that you should free what you malloc, I was forced to reboot a critical server before the "acceptable" hours of 3am - 6am. I applied the patch, and rebooted to a situation where I am expected to believe the problem is solved.

There are several worrying morals to this awful story:

  • Windows is too complicated. As my boss does say (he's not a developer or sysadmin), you should always seek to make things more simple
  • Many eyes make all bugs shallow. Open source software is held up high by geeks for a reason.
  • Don't use pcAnywhere. There is no possible justification for using it.
  • The companies expecting us to trust them with our host-based security systems expect our uptime in return without consequence (NAI, Symantec and CA are all guilty)
  • We (those in the industry) are cynical because we're given valid reasons to be that way
So to those responsible for my pain, those at the biggest security companies in the world, thanks for bringing down our industry in the eyes of the outsiders. You owe we in the trenches an apology, and it's way overdue.

Monday, February 25, 2008

What you know doesn't matter

I have a number of colleagues that base their job security on one metric and one metric alone.

What they know

Unfortunately when questioned further what they mean becomes clearer and more worrying. It's not that their training, insight or approach is unique and not able to be reproduced. Rather, they are referring to (as a most excellent friend of mine put it) "esoteric knowledge" about their environment. Some choose policy as their instrument of security, some choose belligerence and some choose a combination of these and other foundations to ensure their place in an organisation is secure and indispensable.

To all those who think this way - please take a good look at what you are doing. I can assure you that your security is short term and the impact to yourself and others is at best tolerated. As the original aggregator did recently bring to my attention, those who are good at their job inevitably get away from it, at least in IT. Those who do otherwise are characterised quite simply.

Seemingly counter intuitive perhaps, but entirely normal. A great software developer will not be writing code forever and will instead elevate their peers and eventually those who work for them. A great system administrator will not be writing shell scripts forever and will be asked to build a team to continue their most excellent work. Every DBA seeks to take care of bigger and better resourced data warehouses and can surely not expect to do so alone.

In all industries, we all seek to move into better pay rates, better work conditions and generally better lifestyle.

To actually improve your pay rate against the cost of living, you need to increase your value to whichever organisation you work for (even if, and especially if, you work for yourself) - no amount of hard work will offset the gains made through efficiency and knowledge transfer.

To improve your work conditions, the need to maximise your options for growth and advancement whilst minimising your need to perform mundane and repetitive tasks must surely be paramount.

Improvements in lifestyle are a product of both these things - your family and friends are surely not wanting to see you happy and growing as opposed to satisfied and surviving.

We are all human, and advancement is what separates us from the animals. Not passing on knowledge, empowering those around you and failing to relinquish control serves no one, least of all yourself.