Sunday, December 21, 2008

Logs are an admin's best friend....

Or their absolute worst enemy, depending on what's producing them.

One part of my employer's network needs to satisfy one of the industry's most intense and invasive standards - the Payment Card Industry Data Security Standard. Many organisations are grappling with gaining compliance at the moment - it's not at all easy.

One of the major parts of compliance (and an aspect that our QSA identifies as a regular failing of organisations aiming to achieve compliance) is logging. Specifically the retention of logs and protection of logs.

Proven, standardised solutions work best so we of course deploy syslog servers to capture the huge log traffic from the myriad of devices we deploy to obtain our tick each year.

Last Tuesday morning at 0:15 I was about to push some scheduled firewall changes to one of our production firewall clusters. As I'm not a complete and utter nutter (commenters may disagree), I was at home at the time and doing this remotely. 

All attempts to bring up either the SSH or web interface to the firewall cluster failed. Logging in to the office VPN and trying from there failed. I could get into our production network without issue but even from there, getting to the firewall simply wasn't happening. Down to one of Melbourne's tier 3 data centers I went.

At no time was there any indication of any network slowdown at that time, and all our monitoring showed no issues at all. I managed to get to the console on the then active firewall and couldn't see anything standing out as being a problem but the CPU usage was 5 times normal - never good, but at least I had an explaination for the lack of reachability. 

Given these lovely devices are clusted together and no indication otherwise existed, I suspected some weird software fault and forced the other node of the cluster to take over. Access to the main administrative interface was restored...for about 75 seconds. The high CPU usage returned and the SSH and web access dropped away. Our SNMP monitoring was still working, and showing no increase in session count, attacks or memory usage - just high CPU usage. Failing back didn't help either.

After an awful lot of buggering around, I managed to trace the problem to one of our syslog servers. A quick reboot fixed the problem.

Logs were the cause of the problem, and there was no indication anywhere that the syslog server was at fault - we use Juniper gear and the syslog connections did not appear in the output of "get session". This is the second time we've discovered a "silent failure" on our Juniper devices.

I've raised this with Juniper. I suspect that I'll never get a solid resolution - just like we didn't last time.

No comments: