Runbook Best Practices

Runbooks, when used correctly, can not only automate your processes but also help organize different responses across various situations and teams.

Runbooks are constantly polling for matching conditions on incidents, which enables powerful "layering" and "escalation" abilities. For example, an incident can be "escalated" from SEV2 to SEV1 and subsequently attach the SEV1 Runbook and fire SEV1-specific automation.

If you haven't already, you can read Runbooks Basics and Runbook Conditions to learn how they work, as they play key roles in the examples presented below.

Below are examples of common paradigms we've seen FireHydrant customers use.

Common, Organization-Wide Response

Many organizations, especially if smaller, may find value in keeping automation simple. This can mean having just a single primary Runbook that will Always attach and handle most of the necessities. The most common Runbook steps usually include:

The steps above help promote robust, consistent incident response regardless of the situation.

Severity-Based Response

Some organizations will have different responses based on differing severities of an incident. They may start from the lowest severity, add the most common steps, and escalate from there. For example:

The above example uses a layering strategy to intensify the response as an incident progressively becomes more severe. Most notably, it works for both new incidents that start at a specific severity and when incidents escalate and change severity.

NOC and Team-Based Response

Some organizations have not adopted Service Ownership and always have a front line of defense to respond to incidents. From there, they will try to resolve the incidents themselves or may call on extra help when needed.

This strategy helps when teams always have an initial responding team who will try to diagnose the issue. In these instances, they will typically know which additional teams need to be called in and directly summon them.

Generally, FireHydrant recommends using Function/Service-Based response (see next section), but the NOC-type response is still very common throughout the industry and FireHydrant's Runbooks will help standardize automation and response regardless of paradigm.

Function/Service-Based Response

This paradigm is similar to team-based, but rather than focusing specifically on teams, teams are the outcome, and the focus is on functionality or service impact.

With FireHydrant's Service Catalog, you can link together Services with their respective Functionalities. So if a ticket comes in from a customer or an internal, non-engineering resource notices something is funky with the website, they already know what they need to declare an incident and pull in the #web-team.

Combination

Once FireHydrant users become familiar with the platform, the most common scenario we see customers use is a combination of the previous paradigms. For example:

Users will most often segregate specific automation by severity and impacted functions within the platform. They may also have dedicated Runbooks specific to teams if individual teams have different naming conventions for their channels and tickets or require specific webhooks, bookmarks, notifications, etc.

Other Tips

Keep Runbooks specific

Try to keep Runbooks trim, generally single-purpose and descriptive, similar to writing a function in a programming language. This can simplify maintenance, diagnosing issues, and upkeep of your Runbooks over the long run.

Align the organization

Given the flexibility of Runbooks, it helps to ensure your organization is aligned on how to organize them. Differing or conflicting methodologies may lead to duplicate and/or repeated Runbook steps between multiple Runbooks. For example, if you have a common Runbook that all incidents/teams must share, then make sure your individual team members know this when they are constructing their own Runbooks.

Test your Runbooks

All Runbooks have a Test button that allows you to execute that Runbook in isolation. This is useful for testing out new steps or changes you've just made. These test incidents are GAMEDAY severity, meaning they won't impact your analytics, and these incidents will not attach any other Runbooks.

Next Steps

  • Build your Runbooks!
  • Learn more about our Service Catalog and how, in conjunction with Runbooks, your response process can be streamlined
  • Browse the array of Available Runbook Steps we have