What goes up must come down
30 Nov 2012
There are many electronic services which fall, roughly, under the category of “monitoring”. A mixture of commercial and open-source, the list includes systems like Nagios, and my personal recommendation for monitoring websites, Pingdom.
I’ve written before about the importance of “notification by exception”. I could spend part of every day running down a checklist, making sure everything’s ok. Apart from the inefficiency, this relies on me. In a more automated world, I could have an email delivered to me every day, saying that all was well, but if I didn’t notice on the day it came… you get the picture.
For websites I care about (or which I’m paid to maintain), I can arrange to get an email if the site’s down. That’s good: it’s notification by exception. But me receiving that notification relies on quite a “tall” pyramid of interconnected services. What if pingdom were down, or had a bug? What if my email provider, or my ability to receive email, or the loudspeaker in my phone that goes “bing” were broken? In those situations, I would miss a failure.
Admittedly, nothing I monitor in this way is life-and-death important. I got to thinking, though, about a system which could provide similar “alert” capabilities, but have a better “false negative” behaviour. A system which, if it went wrong, or there was a power cut, or there was a genuine fault in whatever it was monitoring, I could rely upon to let me know. I’m thinking in a wider context than just website monitoring here; perhaps I might want an alert that my back door’s been open for 30 minutes, or my bank balance has run low, or something.
I’m happy to be corrected on this, but I think that purely electronic systems fail that test. There isn’t a purely electronic system that can be guaranteed to alert me of all types of internal or external failure.
A type of system which comes closer, however, would be one that combined electronic properties with mechanical properties. Imagine a little marble that runs down a track - but is held back electronically. If the electronics fail, the marble runs. Power cuts, connection failures - that marble runs whatever type of failure happens higher up. Connect the electronic side to the internet, and the electronics can choose to allow the marble to fall when my website goes down.
There’s an obvious disadvantage: I need to be within line of sight of the marble to notice. This can be mitigated, if the system is small enough, by putting the system in a place I often look. Even better, the system could choose to send me an email, and if I don’t respond to the email within 10 minutes, allow the marble to run. That way, I don’t need to be within sight of it, but if there is some bug in the secondary alert mechanism, I’ll still see the marble sooner or later.
I can think of two things that could make this system fail to alert. Mechanical failure is an obvious one, but this is an incredibly simple mechanical system: the marble would have to be sticky, or something. However, it’s possible to conceive of a pernicious bug that caused the electronics to hold the marble back even when a fault is present.
I’m not very good at mechanical stuff, and I don’t know about the formal software techniques that might be used for more critical situations (wikipedia suggests I should read up on formal verification and/or dependently-typed programming). I can, however, imagine writing a very simple loop on something like an Arduino: set voltage low unless you receive an
OK signal on the serial port every so often. After that, I can rig the thing up to any other systems - Pingdom, my bank balance, etc. - and not worry about their individual reliability, since anything that fails higher up in the chain causes the marble to be released.