Sunday, May 31, 2009

Enterprise-level hardware

The enterprise, and hardware designed to service it, is concerned with a few simple things:

  • Visibility - Every piece of the server/switch/router/appliance should be easily monitorable and notify of issues before they become problematic
  • Reliability - Failure is not an option. Everything is redundant.
  • Availability - The hardware should never give reason for reboot - ever.
How horribly far away from this we are. Taking the same three points:

  • Visibility - All of the major "enterprise" hardware vendors ship hardware that falls short. ESX and ESXi on modern boxes is just one example of things going horribly wrong in this regard. Internode's most recent mail server failure is a particularly dark mark on whoever their SAN vendor was - the same can be said for the poor people at WebCentral. I've seen plenty of servers reboot when vendor monitoring and health metrics indicate everything is fine prior, only to see RAM, RAID controller, battery and backplane failures become painfully apparent as the system boots afterward.
  • Reliability - This is an interesting one. Take Dell (please, take them) - the 2850 featured redundant memory and dual RAID controllers. The 2950 does not for some strange reason! The replacement for the 2950-III, the R710 reintroduces these features - thanks Dell!
  • Availability - How can this be so incredibly hard? Foundry, Cisco and Juniper can all give me gear that allows me to upgrade firmware online whilst coordinating BGP, OSPF, IS-IS and MPLS. Take a modern server - firmware for the BIOS, DRAC/iLO controller, RAID controller, IPMI/BMC and the list goes on. Why can't all of these be upgraded and be exposed while the system is online? How hard can it be?
It gets worse. Recently we received a package from Dell containing a firmware update CD for Seagate Barracuda drives. Although we don't use any of the affected drives in production, the included literature was very worrying:
  • Sorry - everything will be offline when you upgrade this firmware.
  • If you've connected the drives in certain ways to our high-end RAID controllers, you'll need to power down and change that before you apply the update.
WTF? Even though Dell is a relative latecomer to the enterprise world, surely they're not that far behind. Even though their blade servers are a pathetic joke compared to everyone else (it's called a redundant midplane - all the cool kids have them).

Tomorrow morning I'm going to ask my boss, the benevolent dictator, for 2 SANs set up in a redundant multipathed configuration for these reasons. I'll also ask for a bunch of VSphere 4 licenses so I can use VMotion to get around these horrible shortcomings.

Sometimes I really hate computers.

No comments: