When a service you own is down you should immediately reach for your runbook (troubleshooting guide or whatever your organization calls it) and start executing it.

The tip for this post is to make sure you actually have runbooks for all the services in your organization. If not, creating them before an actual outage should be a top priority. Ideally this is something that should be built-in to all new work on your schedule. They should contain a methodical checklist of things to look for and common actions to take. They should be reviewed and tested regularly.

But if you do find yourself in the deep-end with no guide, here are a few things I’ve learned along the way when dealing with an outage:

  • Ask for help. If you feel out of your depth, call someone for help quickly. Nobody should think less of you for doing this.
  • Communicate. You probably have upstream and downstream services that are impacted – make sure they’re looped-in. Get into a rhythm of communicating to stakeholders regularly (don’t be sporadic, don’t be late). If it’s a large outage, make sure someone is assigned to send communication or coordinate if it’s not you (if you don’t know, it’s you).
  • Look for the change. Almost all outages are caused by a change. Newly deployed code, a dependency change, customer usage changes – if you can find something that correlates there’s a good chance you found the issue.
  • Rollback. Most service teams know this, but we often fail to act in the moment. If a change has been isolated that looks like it’s the cause, don’t wait – roll it back. Failures started after a recent deployment? Yep, roll it back.
  • Don’t panic. If rollback didn’t work and you need to deploy new changes, don’t be pressured into skipping steps – especially staging and your regular deployment strategy. The last thing you want are more cascading errors you can’t keep track of.

Practice regularly – especially rollbacks. You should be very confident that your run-books work and that rollbacks will be safe. Then, when the time comes, you’ll have less hesitation to act.

A special note on backups… if you haven’t got a plan to test your backups (by running through the runbook to restore on a regular basis) there’s a good chance something will go wrong. A wise manager once told me “if you never tested your backup, you don’t have a backup”.

Here are some good starter links on runbooks if you need them:

Good luck!

-Gary

Leave a comment