How to avoid performance drop in ML-based production systems

Welcome to this blog series on machine learning operations (MLOPs) dedicated to maintaining ML performance in a production system. This is a series on the various ML operations needed in a productionised ML-based system to maintain its performance. In Part I, we will discuss the possible reasons why ML performance drops in a live system and solutions to systematically overcome them. In subsequent posts, we will discuss how we designed the production ML workflow of Virtual Agent, incorporating these concepts and handling its live customers.

Does your ML system’s performance fall when it goes live?

Oftentimes, we face situations where our machine learning models and systems work very well in dev environments, but don’t keep up the performance quality when they are integrated with live production systems.

This is mainly due to two reasons: 

  1. The evaluation dataset used during the development phase was not versatile enough or was not generalizable to the overall unseen population.
  2. Assumptions taken in the form of data and the domain concepts have changed in the live system with time.

The first scenario is related to model development quality where ML engineers or data scientists need to perform various data quality checks to avoid this situation. These data checks are done before using any dataset for training or evaluations, which are needed to have a reliable and complete model-building process.

On the other hand, the second scenario is related to the dynamic nature of the production system. This dynamicity may come in the form of new users, new customer account domains, new product requirements, and the user’s expectations from the system. This is the case particularly in products such as Virtual Agent, where query inputs to the system are completely unstructured and in natural language form. With inputs that have no boundation in terms of what may be asked of the system by the user, the system faces a lot of concepts and data drift situations.

How do we handle drifts in the Data Science workflow of Virtual Agent?

We’ve faced situations where our highly performant models at the development and staging phases received negative feedback from users when deployed for the first time in production.  On inspecting its root causes, we identified there was a need to systematically divide the scope of monitoring. And then reverting to the system with the additions and improvisations based on the user’s collective feedback.

Below is the screenshot of the drop in helpfulness of Virtual Agent by each passing week. It started from 80-85% and dropped to 65% after two months. This is where the ML system is triggered for RCA followed by model updates.

Below are the 2 major scenarios that helped in analysing root causes of performance drop. 

Example scenario 1 :

VA context – Trained on all IT related conversations related to access and troubleshooting
User’s context – Need to create service request using catalog item “ask your HR”
User query – ask your HR
VA response – I did not understand you, please raise a ticket.
User feedback – Negative

Example scenario 2 :

VA context – Trained upon well-formed language queries
User’s context – Using VA as search engine by sending keywords
User query – chargebee
VA response – I did not understand you, please raise a ticket.
User feedback – Negative

The above examples reflect scenarios where an ML system has to continuously monitor, learn, and adapt to their user’s usage behavior. We divide our processes to continuously support our customers in atomic and modular form by various components and phases of ML workflow.

Our pre-model deployment: In the first phase of the model building cycle, we started with a collection of business requirements and transformed them into technical feature requirements. This was followed by data exploration and mining to identify various domains, skills, and labels that can be mapped to our system’s (Virtual Agent) requirements. 

The Data Pipeline step started with selecting the right trainable data with three key features – diversity, representativeness, and unbiasedness. Then, the selected data is transformed into a format acceptable by the training module.

Once a model was built and selected based on multiple developer and business metrics, it was deployed in staging, verified in an integrated system, and then deployed in production systems to make it available to customers.

Model Monitoring in production

We started with collecting user feedback on Virtual Agent’s predictions and evaluated the helpfulness of the system. We needed to keep track of our production model performance on live stream data. We divided our monitoring task into two paths based on the kind of problem, and the repetitive nature of the cycle.

1. Handling Model Drift: A frequently occurring periodic process

An ML model is as good as the data that is sent to it during training. In our earlier phases of ML systems (especially before and just after the pilot phase), models were built with limited representative data. And, with new data and users becoming part of the system, we observed multiple query variations and domains crop up. The model is unable to generalize these unseen patterns in the system, resulting in a drop in the performance of the models. 

At the time of the first model building phase in Virtual Agent, models were trained on structured and well-formed queries during training, but during the serving time, short and insufficient queries were making unexpected outcomes. In order to be up-to-date with these new query patterns and maintain the ML system’s performance and helpfulness, the system needed to identify these new patterns and add them in the re-training module periodically.

In the diagram below, this task is illustrated by the shorter cycle which starts at the “Model Monitoring” component and ends at the “Model Training” component. This cycle is usually less time-consuming (after root cause analysis) and gets activated periodically after drifts are identified in the ML systems.

2. Handling Concept Drift: A comparatively rare, long, and more time-consuming process

Since we are in the ITSM domain, the majority of our requests are IT-based. Keeping that as a priority, we pushed our first model trained in the IT domain. But, with time, when users made some queries, the system did not recognize them. On analyzing its root cause, we observed a domain drift in our query distribution. In some cases, data domains used by the users were different from the definition and assumptions of data we had used to build the model.

To address this situation, we did thorough data analysis activities on historical data to identify the next major domain we need to add to the system to increase the system’s coverage. We also added some negative samples to avoid false positives. After revisiting the initial data design, including and merging new domain skills, and then refreshing (retraining) our models, we were finally able to accommodate these new changes in the system.

In the diagram below, this process is described by the longer cycle which starts at the “Model Monitoring” component and ends at the “Data Science” component. This cycle is usually more time-consuming and gets activated rarely after the model is stable in its query coverage.

In this blog, we focussed on a few challenges we had faced and overcome after the deployment of ML models in a live production system integrated with Freshservice. Do you want to know more about our ML processes and technical stack?  Watch this space for the next blog to understand in detail how these processes are connected and work in a real ML production system (Virtual Agent).