An incident refers to any event that disrupts the normal functioning of your product or service. It’s an anomaly and is not a feature of normal operations. The SLA agreed between you and the end-user defines what normal operation means and what deviation is accepted. Incidents could be in the form of an outage, degraded performance, or a malfunction.
Incident response refers to the actions taken to ensure your services return to normalcy. It’s important as incidents have a direct impact on how customers interact with your product or service. According to a 2019 Forrester report, incidents impact the bottom line and the reputation of organizations.
To ensure services are returned to normal as soon as possible, the concept of incident response came into being. Detection, investigation, communication, and closure are the broad components of an incident response framework.
An incident response framework specifies the actions to be taken to ensure normal services are resumed. It lays the groundwork that helps ensure customers, who depend on your services, aren’t inconvenienced, and your support teams are informed and equipped to work towards a resolution. Organizations may use the framework to develop an incident plan that details the specific sequence steps that need to be taken in the event of an incident.
Incident management is of particular importance in today’s world where workforces are getting increasingly distributed. Whether it’s your own employees or it’s your customers, they’re all interacting with your services from various parts of the world. Communication and collaboration become harder still, which is where a robust framework comes in to help.
Finding out whether your services are up and running can be rather straightforward. However, if information that your service is down routinely comes from employees, or worse, customers, then you’re doing it wrong. This is where monitoring systems can come to help and even detect degradations and downtimes before they affect end-users.
Monitoring systems should be on your list of must-haves and different types of systems perform a few key functions.
Once these monitors detect abnormal events or an incident, they can also be configured to alert specific individuals or teams. In addition, they could be even configured to pass on the event to an event correlation engine or incident management system.
Monitors can also be integrated with ITSM software, like Freshservice, or help desks like Freshdesk, ensuring that a ticket is raised and an agent is assigned to look into the issue.
These systems periodically check if your website or any other end-point is up or down and report back if something’s amiss. Some tools like Freshping also offer services like SSL monitoring, DNS monitoring, etc.
These systems track various aspects of a network and its operation, such as traffic, bandwidth utilization, and uptime. These systems can also monitor specific devices and other elements that are integral to the network.
These systems provide visibility into the performance of your servers. They primarily monitor accessibility and response time, ensuring users enjoy a seamless experience.
These systems provide visibility into the functioning of applications and the supporting infrastructure and networks. They track performance, availability, and the user experience. These monitors could either track real users (real-user monitoring) or replicate certain processes (synthetic monitoring).
These systems constantly monitor your services for potential threats to security. It collects and analyses information to detect suspicious behavior or unauthorized system changes on your network, and helps to define which types of behavior should trigger alerts.
Once detected, the next step is to get to the bottom of things. But before that, mapping the incident on a RACI matrix would help add more clarity regarding the roles of the various stakeholders involved.
RACI stands for Responsible, Accountable, Consulted, and Informed. With each incident being a unique one (hopefully), a RACI matrix would help the most when it comes to defining roles between cross-functional teams.
Businesses would then use ITSM software, like Freshservice, to assign agents to learn more about the incident. These software help with the following:
Here, the agent captures important information like the incident source, services that have been affected, and the date and time at which the incident was reported.
As agents understand the nature of the incident, its effects, the source, and more, the incident is categorized to ensure events of this type can be easily sorted. A sub-category is also generally assigned to the incident. Categorization can also help you look for trends.
Depending on whether the incident is a minor one or a major one, different levels of priority are assigned. This could be a high/medium/low type of classification or an L1, L2, L3… type of classification.
That said, Incident Management software alone may not be able to effectively help manage incidents. It needs to work in conjunction with a configuration management database (CMDB) to accurately pinpoint the cause.
A CMDB is a vast repository that contains information about your IT environment. Asset and configuration details are stored here along with information about the relationships between them. It can help pinpoint the exact series of changes that resulted in the incident. It can also help create an action-priority matrix to better decide on which incidents need to be prioritized.
When an incident happens, it is obvious that we need to fix the incident. While you race to fix the incident, you shouldn’t miss out on the people who would be affected by the outage.
This is where status page software, like Freshstatus, comes into the picture. Status pages are web pages that display your service’s status, communicate and keep the end-user updated about the service status or incident or scheduled maintenance at all times.
In a customer context, many status pages offer the flexibility to customers to subscribe and receive information regarding the same. Customers can also choose to follow a specific incident. Some status page software even allows customers to subscribe and receive alerts regarding one specific service component.
Finally, when the problem has been resolved, you can change the status of the affected services to operational and share a resolution update with your customers.
Once an incident has been resolved, it may not necessarily mean the case has been closed as the resolution may not be a permanent fix. A detailed analysis needs to be carried out at this juncture to understand why the incident happened, how such incidents can be mitigated. This is followed by a Root Cause or Incident Analysis report. In the event of a major incident that could be repeated, the next step would be to find a more permanent solution.
Once such a solution has been identified, tested, and rolled out, it’s also important to understand whether your customers are happy with it. Once you’ve checked all the boxes, dotted the ‘i’s, and crossed the ‘t’s, you can move on to the final stage.
While the name may seem a little morbid, this stage ensures that the loops are closed and customers are provided a detailed record of the incident. Also known as major incident reviews or a root cause analysis (RCA), using these reports to show customers that you’re taking measures to prevent such incidents from happening again also helps build trust.
Here’s what your should cover in your incident post mortem report:
Sorry, our deep-dive didn’t help. Please try a different search term.