Monday, February 25, 2008

We are all connected - A tale of system failure

Over the last 48 hours, a coworker and I have had to deal with a number of system problems across the many hetrogenous setups we take care of.

This is not unusual in our line of work - we have in-house produced, third-party and software-as-service style systems to take care of. We've seen the symptoms manifest from the known causes in the past as well, but this incident served as a most excellent reminder.

One particular service we take care of does real-time information lookups against a third-party's systems. The interconnection between our gear and theirs is not all that complicated, but does require some understanding and care.

Anyway, the third-party in this case had some network problems that knocked out parts of their systems. The problems were such that no lookups could be done, thus the service on our side wouldn't work.

We have a number of companies we interact with via the same interconnection method, and a number of them share infrastructure on our end. As is so often the case, systems often have follow-on effects to the systems they are connected to. A busy database server can cause problems in an application's web tier for example.

In this case, the interconnectivity used means that, without the due care and understanding, a lack of responsiveness from one lookup source and impact the deliverability of the entire service.

All of this is due to one aspect of the system's overall design - a single choice made many years ago (before myself or any of my workmates were involved - the system was acquired from a competitor).

This single choice was the result of vendor loyalty and narrow thinking. Vendor loyalty, in that using technologies recommended by and/or created by a particular vendor without regard for what other systems and technologies will be involved. Narrow thinking in that a single technology, designed for a specific purpose was considered to be the best solution for the wide array of environments that were part of the project's brief from day one.

With so many systems connected to other systems, there are still worrying recurrances of old, well-worn mistakes. Three small facts should be kept in mind when designing any system that interacts with another:
  • Networks fail
  • Bandwidth is finite
  • Latency changes, constantly
Simple really.

No comments: