Engineering at Prezi

Changelog: A Tool Designed to Help You Recover Faster

A big part of dealing with web-based systems is avoiding and handling outages. You do everything you can to make sure they don’t happen, but then they just go ahead and occur anyway.

Image via AndroidPolice

When (not if) things break, it’s important to know where to start looking. Proper monitoring helps isolate the symptoms, but you need to be able to answer a simple question within seconds:

What’s changed in your system in the last ten minutes?

If you find an answer, more often than not, you will seriously cut your mean time to recovery. Changelog is a tool we’ve created and used for almost a year that answers that very question, and now we’re open-sourcing it.

Why you need this in your life

Reverting a change that caused an outage in the first place is usually enough for a recovery. The first step towards isolating that change is creating a list of candidates. Once the first link in the chain of events is confirmed, your postmortem investigation will also have a great place to start looking.

A problem of scale

While the team working on an application is small enough to sit in the same room, members can just shout things like, “Hey, who changed what?” and get an accurate answer in seconds. But once the engineering organization (and the project) grows to the point where you find yourselves in different rooms, you’ll require a centralized system that tracks all the “dangerous” changes.

In the single room scenario everyone will report what they consider dangerous at that moment. But when you build a system that tracks changes, it’s impossible to reliably predict what will cause problems, so it’s safer to track everything. This also makes you less likely to hear the sentence: “I didn’t think that’d cause a problem”. Not hearing this sentence will help you live longer (probably).

Some examples for what to track:

  • deployments, releases
  • feature toggle changes (or feature flag, switch, pick your name)
  • database migrations
  • cloud instances starting, stopping
  • server reboots
  • changes to DNS records
  • changes to server configurations (ideally via chef, puppet, ansible, or something similar)

That’s a whole lot of events you don’t want to forget to send, so automation is key. To make that easier, Changelog provides a horrifically simple API:

1
2
3
4
curl http://changelog.awesomecompany.com/api/events \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"criticality": 1, "unix_timestamp": 1395334488, "category": "misc", "description": "cli test"}'

It wouldn’t be right if I failed to add a link to the talk that inspired this tool in the first place (by Roy Rapoport, of Netflix fame). So I did.

So what does it look like?

Screenshot

Of course, yours won’t feature the black voids of doom (we’re like Batman in that we protect the innocent. Also some of us live in subterranean caves with huge computer screens and leather suits).

As you can see, Changelog provides several ways to filter events; they come in very handy when trying to pinpoint the one (group of) change(s) causing the problem(s).

Head over to the GitHub repo for more details. Here’s a direct link to the getting started section of the readme.

Happy tracking!