Incident management, What, Why and How?

What

Incident management in software refers to the process of identifying, responding to, and resolving unexpected events or failures that occur within a software system. These incidents can range from minor issues, such as a slow page load, to major outages that can impact business operations.

Why

Effective incident management is critical for maintaining the reliability and availability of software systems. It involves a combination of proactive measures, such as monitoring and testing, as well as reactive measures, such as incident response and recovery.

How

Glossary:

Anomaly: An unexpected deviation from the normal behavior of a production system.

Graduation: The process of upgrading an anomaly to an incident if it is determined to have significant impact on the business.

Incident: An unanticipated event that requires an immediate response due to its impact on the business.

Severity: A measure of the impact an incident has on the business, typically represented by levels such as SEV0 and SEV1.

Service Playbook: A predefined set of steps to identify or resolve common issues.

Incident Lifecycle

Occurrence

This is an initial phase of an anomaly or an incident. The occurrence is the moment that a deviating event has created an impact in production causing an anomaly leading to an incident either immediately or eventually if not addressed.

Detection

This is an initial phase where the initial deviating event has been detected and manually or automatically reported, but before a team has started to investigate.

Diagnostics & Mitigation

The on-call team prioritizes utilizing observability data to assess the situation. If no significant business or customer impact is determined, the team will close the alert and label it as a false positive. On the other hand, if an impact is identified, the team will graduate the anomaly to an incident, assigning the appropriate severity level, and trigger incident notification to relevant stakeholders.

The team will then proceed to resolve the incident, following the established playbook or devising an alternative solution as necessary. In the event that additional assistance is required, the team will escalate the incident to involve additional resources.

Upon successful resolution, the team will monitor the incident and communicate its status to all stakeholders.

Closure

During this phase, the service should be already up and running, and the Incident is mitigated.

Notes on Process

To handle incidents in software, it is important to have a well-defined incident management process in place. This should include clear roles and responsibilities, as well as procedures for incident detection, triage, resolution, and reporting. It is also important to have a dedicated incident management team in place, with members who are trained and equipped to handle incidents effectively.

It is also important to have procedures in place for incident recovery. This includes activities such as data backup and restoration, as well as testing and validation of the recovered system. It is also important to conduct post-incident (Postmortem) reviews, to identify areas for improvement and to prevent similar incidents from occurring in the future.

And in order to continually improve processes, the following metrics should be meticulously collected:

  1. Time to Detect (TTD): Measures the time from the occurrence of a deviant event to its detection.

  2. Time to Acknowledge (TTA): Measures the time from the detection of a deviant event to the acknowledgement of the on-call notification.

  3. Time to Graduate (TTG): Measures the time from the acknowledgement of the on-call notification to the final assessment of the event impact.

  4. Time to Verify (TTV): Measures the time from the implementation of a potential remediation to the verification of effective system operation.

  5. Time to Close (TTC): Measures the time from full system functionality to the closure of all action items and final fixes.

  6. Time to recovery (TTR): Measures the time from the system fails to the time it is fully functioning again.

Final note

Overall, incident management in software is a complex process that requires a combination of technical expertise, effective communication, and a well-defined process. By implementing a robust incident management process, organizations can ensure that incidents are handled effectively, and that the impact of incidents is minimized.