A key component of every incident response plan is incident identification. This stage begins when an abnormality is detected in your systems. Broadly speaking, incident identification can be broken down into two components: incident detection and incident investigation.
Finding out whether your services are up and running can be rather straightforward. However, if information about outages comes from employees, or worse, customers, then you’re doing it wrong. This is where monitoring systems can come to help and even detect degradations and downtimes before they affect end-users.
Monitoring systems should be on your list of must-haves and different types of systems perform a few key functions. Once these monitors detect abnormal events or an incident, they can also be configured to alert specific individuals or teams. In addition, they could be even configured to pass on the event to an event correlation engine or incident management system.
These systems periodically check your applications or services or systems is up or down from the end-user perspective and report back if something’s amiss. This is often referred as synthetic monitoring. Some tools like Freshping also offer services like SSL monitoring, DNS monitoring, etc.
These systems track various aspects of a network and its operation, such as traffic, bandwidth utilization, and uptime. These systems can also monitor specific devices and other elements that are integral to the network.
These systems provide visibility into the performance of your servers and the operating system. They primarily monitor availability, response time, server health to ensure the applications hosted in them are getting optimal hardware allocation.
These systems provide visibility into the functioning of applications and the supporting infrastructure. They help monitor the application load by different components (e.g., transactions per second (tps), requests per second, pages per second) and their response time.
These systems constantly monitor your services for potential threats to security. It collects and analyses information to detect suspicious behaviour or unauthorized system changes on your network, and helps to proactively mitigate any potential threats before it becomes a crisis.
Once detected, the next step is to get to the bottom of things.
Here, the agent captures important information like the incident source, services that have been affected, and the date and time at which the incident was reported.
Agents understand the nature of the incident, its effects, the source, the impact, and other relevant information, and categories the incident to ensure the right stakeholders are brought in to resolve the incident at the earliest.
Depending on whether the incident is a minor one or a major one, different levels of priority are assigned. An action priority matrix would help you classify the incident by severity.
Next, mapping the incident on a RACI matrix would help add more clarity regarding the roles of the various stakeholders involved.
RACI stands for Responsible, Accountable, Consulted, and Informed. With each incident being a unique one (hopefully), a RACI matrix would help the most when it comes to defining roles between cross-functional teams.
Businesses would then use ITSM software, like Freshservice, to assign agents who would take ownership of the incident and be responsible for its resolution within a stipulated SLA.
That said, Incident Management software alone may not be able to effectively help resolve incidents. It needs to work in conjunction with a configuration management database (CMDB) to accurately pinpoint the cause.
A CMDB is a vast repository that contains information about your IT environment. Asset and configuration details are stored here along with information about the relationships between them. It can help pinpoint the exact series of changes that resulted in the incident. It can also help create an action-priority matrix to better decide on which incidents need to be prioritized.
Once you’re sure of the cause of the incident, it’s time to decide if communicating it to customers is necessary. In the case of small incidents that don’t affect customers, you may choose to handle it without any external communications, however, if your end customer is inconvenienced, then ensuring that the message reaches the customer is of paramount importance.
Read more: How to communicate downtime incidents better
Sorry, our deep-dive didn’t help. Please try a different search term.