A key component of every incident response plan is incident identification. This stage begins when an abnormality is detected in your systems. Broadly speaking, incident identification can be broken down into two components: incident detection and incident investigation.
Finding out whether your services are up and running can be rather straightforward. However, if information about outages comes from employees, or worse, customers, then you’re doing it wrong. This is where monitoring systems can come to help and even detect degradations and downtimes before they affect end-users.
Monitoring systems should be on your list of must-haves and different types of systems perform a few key functions. Once these monitors detect abnormal events or an incident, they can also be configured to alert specific individuals or teams. In addition, they could be even configured to pass on the event to an event correlation engine or incident management system.
These systems periodically check if your website or any other end-point is up or down and report back if something’s amiss. Some tools like Freshping also offer services like SSL monitoring, DNS monitoring, etc.
These systems track various aspects of a network and its operation, such as traffic, bandwidth utilization, and uptime. These systems can also monitor specific devices and other elements that are integral to the network.
These systems provide visibility into the performance of your servers. They primarily monitor accessibility and response time, ensuring users enjoy a seamless experience.
These systems provide visibility into the functioning of applications and the supporting infrastructure and networks. They track performance, availability, and the user experience. These monitors could either track real users (real-user monitoring) or replicate certain processes (synthetic monitoring).
These systems constantly monitor your services for potential threats to security. It collects and analyses information to detect suspicious behavior or unauthorized system changes on your network, and helps to define which types of behavior should trigger alerts.
Once detected, the next step is to get to the bottom of things.
Here, the agent captures important information like the incident source, services that have been affected, and the date and time at which the incident was reported.
As agents understand the nature of the incident, its effects, the source, and more, the incident is categorized to ensure events of this type can be easily sorted. A sub-category is also generally assigned to the incident. Categorization can also help you look for trends.
Depending on whether the incident is a minor one or a major one, different levels of priority are assigned. An action priority matrix would help you classify the incident by severity.
Next, mapping the incident on a RACI matrix would help add more clarity regarding the roles of the various stakeholders involved.
RACI stands for Responsible, Accountable, Consulted, and Informed. With each incident being a unique one (hopefully), a RACI matrix would help the most when it comes to defining roles between cross-functional teams.
Businesses would then use ITSM software, like Freshservice, to assign agents who would take ownership of the incident and be responsible for its resolution within a stipulated SLA.
That said, Incident Management software alone may not be able to effectively help resolve incidents. It needs to work in conjunction with a configuration management database (CMDB) to accurately pinpoint the cause.
A CMDB is a vast repository that contains information about your IT environment. Asset and configuration details are stored here along with information about the relationships between them. It can help pinpoint the exact series of changes that resulted in the incident. It can also help create an action-priority matrix to better decide on which incidents need to be prioritized.
Once you’re sure of the cause of the incident, it’s time to decide if communicating it to customers is necessary. In the case of small incidents that don’t affect customers, you may choose to handle it without any external communications, however, if your end customer is inconvenienced, then ensuring that the message reaches the customer is of paramount importance.
Read more: How to communicate downtime incidents better
Sorry, our deep-dive didn’t help. Please try a different search term.