Journal of a Family System Administrator

Introduction

My kid is a fan of The IT Crowd, so it didn't surprise me when I received the following bug report:

Hey Dad. My computer was acting funny. So, I just turned it off for awhile. Just letting you know.

My kid's desktop runs the following servers to collect and export machine measurements:

collectd
Prometheus Collectd Exporter

On another machine, I run Prometheus to collect, store, and analyze my measurements. I wondered if I can use Prometheus debug what happened.

Up or Down

I would like to learn when the machine got turned off. Prometheus uses HTTP to collect metrics from the target machine. If Prometheus can finish this scrape, it stores a 1 in a time series named UP -- a failed scrape will get a value of 0. I learned two things:

The computer was turned off at 02:00.
The computer failed a scrape two hours earlier! This means Prometheus was unable to use http to collect metrics from the Collectd Exporter target. Very odd.

Dashboard

I customized the existing prometheus console templates to work with collectd's plugins. This makes it easier to view the common machine metrics in one spot. I discovered the following from that machine's dashboard:

At one point, all the CPU and RAM were in use -- this was when the Prometheus HTTP scrape failed.
The machine went idle for two hours.
There was a spike before the machine was powered off. I am guessing my kid used the soft power off button and didn't unplug the computer from the power outlet.

Guessing

At this point, I used the expression browser to hunt for another collectd metric. I discovered that the whole swap partition was in use:

It looks like my kid used all the RAM and swap. Afterwards the system didn't recover. Not sure if that was a kernel issue or just regular userland ENOMEM kinda bugs. I asked him what he was doing -- he said "changing settings to make a game go faster".

Introduction

In my role as family system administrator, I try to keep the NAS running -- don't want to lose all those billions of cat photos that the kids take. On my NAS, I run the following servers to collect and export machine measurements:

collectd
Prometheus Collectd Exporter

On another machine, I run Prometheus to collect, store, and analyze my measurements. For more details about Prometheus, you might want to review the "Prometheus: A practical guide to alerting at scale" presentation and slides.

Bad Disk Investigation

When I awoke on Friday the 27th, I noticed the electric power must have went out during the night because the clock on the microwave was flashing. The NAS didn't reboot because I use a UPS. That is when I noticed a hard disk failed and now needs to be replaced. I replaced the drive two days later. But I wondered -- did the power event cause the drive to fail? The answer turned out to be no -- but I thought it would be useful to capture the data I used to make this conclusion.

Timeline

Fri, 20 Nov 2015 10:51:48 GMT:monitoring noticed drive failed
Fri, 27 Nov 2015 05:21:15 GMT:very short power event
Fri, 27 Nov 2015 12:00:00 GMT:noticed hard disk failed
Sat, 28 Nov 2015 20:24:52 GMT: replaced failed hard disk

When did the drive fail?

I use smartctl to measure the temperature of my hard disks. I noticed that the temperature of one of my disks was missing. This measurement helped me determine when the drive failed.

When did the power event happen?

I poll the status of the UPS every 10 seconds. The power event had to be shorter than 10 seconds, because the monitoring never even noticed this event -- the UPS status was always "on mains". I found a spike in the UPS remaining capacity measurement -- that is the only clue I had about when the power event happened. Comparing these two graphs showed that the disk failed about a week before the power event.

Did any alerts fire?

Prometheus stores pending/firing alerts in the time series database. I wanted to know if there were any firing alerts that I missed -- did I miss an alert for a week about my bad disk? Nope -- the only firing alert was when I rebooted the NAS. I will need to add another alert for failing disks.

What did I learn?

The FreeBSDism of using 'sysctl -n kern.disks' to find a list of disks in a machine appears to quietly drop failed or missing hard disks -- this is a feature for removable USB drives. I cannot use this measurement to track the disks expected to be alive on a machine.
Polling the UPS might not give the required data for very short power events. I might want to enable SNMP traps to capture very short power events.
I don't want to leave my raid array with one dead disk for a whole week. I am going to setup some alerts for this!

Journal of a Family System Administrator

Monday, December 7, 2015

Using machine monitoring for post-mortem kernel debugging.

Introduction

Up or Down

Dashboard

Guessing

Sunday, November 29, 2015

Did the power outage trigger my disk to fail?

Introduction

Bad Disk Investigation

Timeline

When did the drive fail?

When did the power event happen?

Did any alerts fire?

What did I learn?