Incident management refers to the actions taken to ensure your services return to normalcy. It’s important as incidents have a direct impact on how customers interact with your product or service.
According to a 2019 Forrester report, incidents impact the bottom line and the reputation of organizations.
There are six major phases when it comes to managing an incident:
Preparation can refer to the development of the incident management framework, the plan, as well as preparing the organization to handle incidents. Whether it's something minor that wouldn't affect customers or something big that would, a well-prepared incident response plan covers it all.
This stage helps organizations objectively define the severity of the incident and enables teams to take appropriate action.
This stage could also prepare teams by conducting mock drills to ensure the knowledge gained isn't merely theoretical.
This stage begins with the incident and enables organizations to gauge the nature and severity of the incident.
Identification can be subdivided into detection and investigation. Detection typically happens through dedicated monitoring software or through customers or end-users.
Once detected, the next step is to learn more about the incident and who should be involved in the resolution. An action-priority and RACI matrix can help with that.
When detected, the incident is typically logged in an ITSM software, and the individuals identified as responsible in the RACI matrix are assigned to look into the incident.
The goal of this stage is two-fold—limit the impact of the incident and regain control of your systems.
To this end, the affected component should be isolated from the rest to control the magnitude of the damage. If you have any backups or redundant systems, now would be a good time to bring them online instead of the affected ones.
It's important to note that different incidents might need different strategies. Depending on this, it's important to also craft a long-term and short-term containment strategy.
Having an ITSM tool with a well-configured CMDB (component management database), like Freshservice, can also help you pinpoint the source of the incident and take action.
This stage of the process deals with removing the cause of the incident, whether it's fixing code, deleting compromised accounts, or any other action that can help ensure the incident doesn't happen again.
The learnings from this stage are documented and imparted to the company's incident response training program.
Owing to the varied nature of incidents, it's also possible that at this stage no actions are taken towards eradication and the reason for the incident is addressed over a longer period of time
The purpose of this stage is to bring the affected system back to the production environment after the root cause is identified and resolved.
This could be done by restoring systems from backups, rebuilding systems, installing patches, replacing files, and adding monitors.
In the case of larger incidents eradication could be a part of the recovery process. In such cases, a phased approach should be taken. In the short run, systems should be secured and restored so that the incident doesn't crop up again. In the long run, larger infrastructure changes can be introduced to ensure your systems are as stable and secure as possible.
After an incident is resolved, every organization should have a document that explains the cause of the incident, and a detailed account of the actions taken.
Doing this helps ensure incidents and the resolutions are documented, serving as a guide on how to ensure the incident doesn't crop up again, and what can be done to ensure a quick resolution in case it does.
Once the incident is resolved, teams and individuals identified in the identification stage should meet with the incident report and discuss. Any valuable learnings or examples can be included in the training modules for new members.
Sorry, our deep-dive didn’t help. Please try a different search term.