ITSM Incident Management: Tutorial & Best Practices

ITSM Incident Management: Tutorial & Best Practices

Start a free trialBook a demo

Incident management has become a business-critical function in a world where users of web and mobile applications expect screen refresh rates of only a few milliseconds. Any IT service failure—be it a cloud outage, a misconfigured firewall, or a software bug—can directly impact business continuity, employee satisfaction, and sometimes, in the worst cases, even regulatory compliance. There are countless examples of outages that affected millions of users and could have been handled better. 

An optimized incident management (IM) practice helps organizations respond to outages quickly, restore services, minimize disruption and employee dissatisfaction, and learn from each occurrence. 

This article explains the main characteristics of an effective incident management practice and how to set it up and continuously improve it, including choosing the right tooling, considering the human aspect, and measuring success. 

Summary of the key focus areas for an effective incident management practice

Focus area

Description

Incident management practice setup and continual improvement

Define and implement an organizational incident management practice that fits your organization's specific needs, and continuously improve its effectiveness over time.

Selecting the right tool for the job

Ensure that you select and utilize the right incident management tool, equipped with features that are meaningful, efficient, and effectively support your practice.

People aspects

Recognize and engage all key stakeholders involved in incident resolution, including customers, users, internal practitioners, and external suppliers.

Measuring success

Establish and track the appropriate metrics and indicators to accurately measure the effectiveness and efficiency of your incident management practice.

Incident management practice setup and continual improvement 

Setting up an effective IM practice begins with understanding that there is no one-size-fits-all solution. What works in one organization might not work in another. 

The key is to assess your current incident management maturity using established models like the CMMI or the simpler Forrester ITSM Maturity Model. You can start with the recommendations of ITIL 4's IM practice guide and align them with your business's size, complexity, needs, resources, and risk appetite. To be clear, this isn't about having a “service desk” but rather a structured, repeatable framework that spans across the entire IT organization to handle disruptions systematically.

Basic Incident Management Flow

Basic Incident Management Flow

This practice definition should encompass the following areas.

Scope and definitions

Clearly articulate the basic definition of an incident and what should be handled via this practice. One of the best-known definitions of an incident is “an unplanned interruption to an IT service or reduction in the quality of an IT service”. But what happens if a service component fails but the service keeps running? How will that be handled? Establishing clear, comprehensive definitions helps incident management teams focus on what is relevant and important, and it prevents non-incidents from “polluting” your analytics and reporting.   

Flow design

Define the lifecycle of an incident, building the full map from detection, logging, categorization, prioritization, diagnosis, resolution, and closure. This should be a very detailed document that everyone involved can understand and follow. Tools like Lucidchart or even Freshservice's built-in workflow designer can help create visual flows that automatically update based on incident type. 

Don’t forget that IT is there to support the business, so make sure you identify and consider various stakeholders' needs when designing the flows. Otherwise, you will end up with great flows on paper, but you will see people ignoring the process and doing things their own way. A good idea here is to set up a service management office (SMO) with broad representation that participates in the design and improvement of the practice.

Roles and responsibilities

Define who is accountable, responsible, consulted, and informed using the RACI matrix, a simple tool that is often overlooked but highly effective. The last thing you want in the middle of an outage is to see teams fighting about who should be doing what. Consider these essential roles: 

  • Incident manager (owns the process)

  • Incident coordinator (manages major incidents)

  • Technical leads (provide expertise)

  • Communications manager (handles stakeholder updates)

Categorization and prioritization

It is ideal to have a clear system for categorizing incidents (e.g., by affected service, technical component, etc.) and prioritizing them based on impact and urgency. Each company should have its own definitions for what constitutes high/medium/low impact or priority. For example, a high-impact incident could be considered a production system down, while high urgency could mean there is no available workaround. 

Modern tools like Freshservice can automatically assign roles based on incident type and even suggest the best resolver based on historical performance.

A common pitfall is defining business impact based on the number of affected users. Imagine that you have two incidents: one that affects a single user and printer, but it’s the shipping gate and trucks cannot leave the factory due to missing paperwork; and another involving printing in the marketing department that has 10 users unable to print, but there’s no real urgency. Lastly, make sure you build a priority/downgrade mechanism in order to adjust priority if incorrectly set. 

Escalation paths

Define clear escalation paths and criteria for incidents that cannot be resolved within the specified timeframes or that are more complex. Consider both functional escalation (from L1 to L2 or L3) and hierarchical escalation if a higher authority is required. Tools like Freshservice can help with automating escalations

Communication strategy

This should include communications to internal teams, updates to users and employees, as well as reporting to senior management. Sometimes, depending on the business impact, crisis communication mechanisms need to be put in place, so make sure you link to them. 

Real-time communication is essential. Set up incident war rooms for critical incidents (Teams/Slack, message boards, mails, etc.) and make sure the incident record is constantly updated so that everyone knows what has been done, what is being done, and what the plan is. This helps reinforce trust in the resolution team. There are solutions that offer out-of-the-box integrations with Slack/MS Teams for real-time agent notification about ticket updates. 

Lessons learned

After each incident—specifically after each critical incident—a post-incident review (PIR) is essential. Do not confuse this with a root cause analysis; the PIR focuses on what could have been done better during the incident (e.g., a contact number was wrong and a team could not be engaged, the flow was not followed, the initial priority definition was incorrect). Tools like Freshservice can use AI to automate PIRs.

Continual improvement

Once the practice has been implemented, it will require continual improvement and tweaking. Everything around us evolves, and the IM practice needs to keep up. The ITIL continuous improvement model is a good start. Consider using a service management office (SMO) solution. Using PIRs, feedback loops, analytics data, and technology updates, the SMO will ensure that the practice keeps up with the ever-evolving organizational landscape.

ITIL continual improvement model

ITIL continual improvement model (Source)

Selecting the right tool for the job

A critical enabler for an incident management practice is having the right tools for the job. The right tool can speed up incident resolution, enhance communication, and provide actionable insights. 

Key features to look for include the following:

  • Overall IT service management (ITSM) capabilities: A single repository for all incidents, problems, changes, requests, configuration items (CI) allows for a better overview, prevents silos, and provides a more holistic view that is of use when trying to resolve incidents. Access to information like which CIs are affected, when a CI was changed, what similar incidents have been reported in the past, and what problems have been linked to this incident can provide a treasure trove of information to resolution teams. Ideally, look for a tool that goes beyond the basic ticketing capabilities. 

  • Knowledge bases (KBs): Resolvers need quick and easy access to relevant knowledge articles in order to troubleshoot and resolve incidents in the shortest time frame possible. Integration between the ITSM and KB toolings is critical; otherwise, teams will be wasting time. Top-tier tools take it one step further than merely integration and can use AI or other methods to suggest the most relevant articles or even help with writing articles. When dealing with hundreds of incidents daily, the time and productivity savings add up quickly. 

  • Workflow automation: Routine tasks like ticket creation or closure, incident assignment, automatic notifications, and automatic escalations using a predefined set of criteria are ideally handled by the tooling you select. This capability will reduce manual efforts and ultimately speed up incident resolution. 

  • Monitoring and alerting: What’s better than fixing an incident quickly? How about stopping it from happening in the first place? How do you ensure that the time needed to restore incidents is minimized? Get alerted about them even before users start reporting. Monitoring and alerting tools can help draw the attention of the support teams and significantly improve incident prevention or resolution by analyzing and grouping events. When selecting a tool for handling incident management, keep an eye out for capabilities related to monitoring and alerting or integration capabilities.

  • Reporting and dashboarding: Insights are crucial for identifying trends, bottlenecks, measuring performance, and driving continual improvement. These capabilities allow you to make informed decisions and increase the chances of taking the right decision at the optimal moment.

  • Communication and collaboration features: Built-in communication channels (e.g. individual or group chats, email integrations) allow teams to collaborate effectively and ease communication with stakeholders. A tool that allows you to use all the relevant communication channels is paramount. 

  • Flexible integration capabilities: Depending on the size of the organization, a bundle of different tools can be used in various parts of the organization. When selecting the right tool for incident management, you should also keep an eye out for broader integration capabilities. Integrations with chat tools, live translation services, vendor ticketing tools, or other enterprise systems can simplify the workload for resolution teams. 

  • Self service and mobile accessibility: Empowering users with the ability to report incidents anytime and from any device improves both employeesatisfaction and the load on the service desk. Resolution teams can also benefit from being able to update incidents from anywhere. 

People aspects

While the right processes and tools are fundamental, in the end, everything depends on the humans involved. We work across different cultures, industries, and markets with different social and political factors. There are multiple types of stakeholders involved in every incident, so recognizing these roles and providing them the right support is key to effective incident resolution.

People are the glue that holds everything together

People are the glue that holds everything together

Incident design and tooling selection should take into account the needs and expectations of each category:

  • Users/employees are the reason you are in business; they purchase and consume the IT services, so their perspective is critical. Empowering them via the right self-service capabilities and providing clear, concise and timely communication helps build trust and manage employee dissatisfaction.

  • Practitioners (IT staff) are all the internal IT teams that are involved in incident resolution, such as service desk personnel, technical resolution teams, and incident managers. Their focus is the technical resolution, but all the work follows the workflows mapped in the tooling and relies on how quickly they can access information. 

  • IT leadership provides direction to the organization, but they require the right mix of reports and dashboards to empower them to make informed decisions. 

  • Third-party vendors are relied upon by many organizations to provide critical services (e.g., cloud providers, software vendors, support). When an incident requires involving a supplier, you need to ensure that you have the right framework in place: 

    • Contracts and agreements: Defining ways of working and responsibilities of each party, in advance, streamlines incident resolution. This includes aligning incident workflows and tooling. 

    • Communication channels: Knowing who to contact at the right time. 

Apart from recognizing each stakeholder group and their specific needs, you also need to ensure the creation of the right culture and mindset inside our support organization. This includes encouraging a focus on value among all support staff; everything they do, operational work or projects, needs to tie in to the value for the business.

Measuring success

This part might seem straightforward: Get some common metrics, and that’s it. But it’s not that simple or easy. Effective measurement is an art. Let’s explore why. 

Usually, when managing staff, processes, tools, or services, organizations set targets and rely on measurements to tell them if they are heading in the right direction or not. But this misses a crucial component: the human component in dealing with objectives. This disconnect can lead to the dreaded watermelon effect, where we see metrics as green even though from the user perspective, they are red. 

Watermelon effect of monitoring metrics

Watermelon effect of monitoring metrics

Goodhart’s Law states that “when a measure becomes a target, it ceases to be a good measure” and is a good indicator of why this happens. 

Human nature is like this: Once targets are set, people will find ways to keep the metrics green, while in reality, the users don’t see it the same way. For example, everyone has seen cases where an agent assigns a ticket back to a user just to stop the SLA clock and keep the target green. 

So, how can you ensure that you have the right measurements in place? Like in the incident definition, there is no one-size-fits-all solution. You have to experiment and see what works best. Ideally, you need to measure the same aspect from several perspectives to make sure you have the right view on it. 

There are three basic questions that you always need to ask yourself:  

  • What needs to be measured?

  • Why do we measure it? 

  • How do we measure it? 

The ultimate goal of measuring is not to report numbers but to drive informed decisions. 

Having said this, there are key metrics employed widely in the industry that can be used as a starting point for measuring the performance of your resolution teams (keeping in mind Goodhart’s Law and the three questions above): 

  • SLA adherence: The percentage of incidents resolved within the SLA timeframe

  • Mean time to restore (MTTR): The average lifespan of incidents

  • First contact resolution (FCR): What percentage of incidents are resolved at first contact

  • Resolved at Level 1 (RAL1): The percentage of incidents resolved at Level 1

  • Employee satisfaction (ESAT): The percentage of users who are satisfied with the services

  • Average first response time (AFRT): The average time needed to pick up tickets

For more best practice ideas around metrics, have a look at this benchmarking report

How Taylormade automated 100% of their mission critical workflows with Freshservice

Read customer story

Conclusion

An effective incident management practice minimizes business disruption, preserves reputation, protects revenue streams, and enhances employee satisfaction. It is a measure of the organization’s ability to withstand the disruptive effects of IT outages. 

Ready to stop simply reacting to incidents and start leading the charge towards better IT resilience? It all begins with a fresh look at your current processes and how they stack up against your actual business goals.

Explore how Freshservice, with its intuitive AI and robust automation, can help you move beyond just fixing problems to truly achieving strategic service excellence.