Monday, December 7, 2015

Using machine monitoring for post-mortem kernel debugging.

Introduction

My kid is a fan of The IT Crowd, so it didn't surprise me when I received the following bug report:
Hey Dad.  My computer was acting funny.  So, I just turned it off for awhile.  Just letting you know.
 My kid's desktop runs the following servers to collect and export machine measurements:
On another machine, I run Prometheus to collect, store, and analyze my measurements.  I wondered if I can use Prometheus debug what happened.

Up or Down

I would like to learn when the machine got turned off.  Prometheus uses HTTP to collect metrics from the target machine.  If Prometheus can finish this scrape, it stores a 1 in a time series named UP -- a failed scrape will get a value of 0. I learned two things:

  • The computer was turned off at 02:00.
  • The computer failed a scrape two hours earlier! This means Prometheus was unable to use http to collect metrics from the Collectd Exporter target.  Very odd.


up metric


Dashboard

I customized the existing prometheus console templates to work with collectd's plugins.  This makes it easier to view the common machine metrics in one spot.  I discovered the following from that machine's dashboard:

  • At one point, all the CPU and RAM were in use -- this was when the Prometheus HTTP scrape failed.
  • The machine went idle for two hours.
  • There was a spike before the machine was powered off.  I am guessing my kid used the soft power off button and didn't unplug the computer from the power outlet. 


cpu

ram

Guessing

At this point, I used the expression browser to hunt for another collectd metric.  I discovered that the whole swap partition was in use:

swap
It looks like my kid used all the RAM and swap.  Afterwards the system didn't recover.  Not sure if that was a kernel issue or just regular userland ENOMEM kinda bugs.  I asked him what he was doing -- he said "changing settings to make a game go faster".



Sunday, November 29, 2015

Did the power outage trigger my disk to fail?

Introduction

In my role as family system administrator, I try to keep the NAS running -- don't want to lose all those billions of cat photos that the kids take.  On my NAS, I run the following servers to collect and export machine measurements:
On another machine, I run Prometheus to collect, store, and analyze my measurements.  For more details about Prometheus, you might want to review the "Prometheus: A practical guide to alerting at scale" presentation and slides.

Bad Disk Investigation

When I awoke on Friday the 27th, I noticed the electric power must have went out during the night because the clock on the microwave was flashing.  The NAS didn't reboot because I use a UPS. That is when I noticed a hard disk failed and now needs to be replaced.  I replaced the drive two days later. But I wondered -- did the power event cause the drive to fail?  The answer turned out to be no -- but I thought it would be useful to capture the data I used to make this conclusion.

Timeline

  • Fri, 20 Nov 2015 10:51:48 GMT:monitoring noticed drive failed
  • Fri, 27 Nov 2015 05:21:15 GMT:very short power event
  • Fri, 27 Nov 2015 12:00:00 GMT:noticed hard disk failed
  • Sat, 28 Nov 2015 20:24:52 GMT: replaced failed hard disk

When did the drive fail?

I use smartctl to measure the temperature of my hard disks.  I noticed that the temperature of one of my disks was missing.  This measurement helped me determine when the drive failed.

hard drive temperature

When did the power event happen?

I poll the status of the UPS every 10 seconds.  The power event had to be shorter than 10 seconds, because the monitoring never even noticed this event -- the UPS status was always "on mains".  I found a spike in the UPS remaining capacity measurement -- that is the only clue I had about when the power event happened.   Comparing these two graphs showed that the disk failed about a week before the power event.

power event

Did any alerts fire?

Prometheus stores pending/firing alerts in the time series database.  I wanted to know if there were any firing alerts that I missed -- did I miss an alert for a week about my bad disk?  Nope -- the only firing alert was when I rebooted the NAS.  I will need to add another alert for failing disks.

alerts



What did I learn?

  • The FreeBSDism of using 'sysctl -n kern.disks' to find a list of disks in a machine appears to quietly drop failed or missing hard disks -- this is a feature for removable USB drives. I cannot use this measurement to track the disks expected to be alive on a machine.
  • Polling the UPS might not give the required data for very short power events.  I might want to enable SNMP traps to capture very short power events.
  • I don't want to leave my raid array with one dead disk for a whole week.  I am going to setup some alerts for this!