How we modeled our ML systems to achieve higher performance

Primary author: Vini Dixit was a Staff Data Scientist in the Virtual Agent (Freddy) team at Freshworks. She’s passionate about solving AI problems, building reusable ML systems, and documenting her learnings. She’s also the author of the first part of this blog series, which talks about avoiding performance drops in ML-based production systems. When she’s not coding, Vini usually spends her time cycling and trekking. You can reach out to her on LinkedIn or mail her at: vini01.dixit@gmail.com

 

Our previous blog dealt with the challenges we face as machine learning (ML) engineers and data scientists after deploying our models in production. This blog will elaborate on our processes and illustrate our system at a deeper level by showing how automation and scale are achieved in the midst of model building, continuous model updates, and monitoring in a live and dynamic ML production system.

Here is a snapshot of the Virtual Agent product workflow for data science processes. It is an in-depth flow diagram with all the major interacting components to build ML models, serving live users in the system, collecting feedback & continuous monitoring, and model updates.

ML systems virtual agent

Each continuous process is described by a sequential phase, and we have a total of 5 phases (including the training loop) available in our overall workflow.

A typical ML lifecycle in its initial and mostly offline phase consists of data collection and preparation, model training, model selection, and deployment. When the model becomes active, a few more steps get added, like loading at a proper infrastructure/machine/host to get it used by the users where the model will now predict labels for new incoming queries. We log these predictions to observe ML performances.

On top of these steps, we have added a few more steps to achieve continuous monitoring and update our models automatically and at scale. These steps add new and corrected labels through integration with Label Studio (explained in a section below), stored in Baikal (the Freshworks data lake) tables. Then this data gets used to monitor drift points (described in part 1) through a Zeppelin dashboard. And finally, the trainable data are used in retraining the models. We’ll explain these steps in detail below.

Phase 1 – ETL data, model building, and version control

Model building in an ML lifecycle is driven by ML experiments. We built more than 100 ML experiments with more than 150 runs on average for each ML experiment to get the best possible combination of data, configuration hyperparameters, model, and metrics while considering 170 evaluation criteria.

We built big data pipelines for ETL (data extraction, transformation, and loading) integrated with the Baikal data lake to scale this model-building process. We use DataBricks’ ML platform to run our experiments and build the models. We were able to use the high-end hardware configuration of AWS compute machines for fast model building. DataBricks system is integrated with dedicated DS S3 buckets to store our ML experiments to be able to reproduce them in the future.

While building ML experiments at scale, it was essential to get versioning of the experiment’s input variables and output in the form of the model and its metrics. It was often a challenge to reproduce the same input model hyperparameters, training, and test data, that had produced optimal metrics. Especially after a few weeks or months when we need to update an existing production model, it is critical to know about its earlier training and evaluation parameters to extend this model.

Below is a screenshot of Managed MLFlow from ML experiments in Virtual Agent, explaining each model tied up with an experiment_tag. Each experiment_tag is associated with training code, training data, test data, results, and models.

This is how the ML model lifecycle becomes completely reproducible at any point in time.

ML systems

So, to log versions of ML experiments, we integrated MLFlow to log experiment metrics and experiment parameters and used Databricks Delta Table to log versions of data artifacts for separate experiments in our model building pipeline. This setup helped us reproduce the ML experiment environment in the form of data, code, model, results, and metrics.

Phase 2 – Model serving, predictions, and customer feedback logging

After extensive ML experiments, configurations and parameters had to be chosen based on multiple selection criteria. These criteria include but are not limited to model development metrics, precision and recall tradeoff at each ML model serving component, accuracy vs. system latency, error tolerance, ranking and relevance quality, deflection/helpfulness in the responses, etc. 

After testing through multiple iterations, our final model was deployed using Amazon ECS out of hundreds of models generated from ML runs.

Virtual Agent is integrated with the Freshservice platform, which handles all the user interface activities and passes them back to the Virtual Agent. Freshservice collects feedback given by the users to Virtual Agent’s responses. We log these feedback and system predictions in our database to get utilized further.

Phase 3 – Extending feedback labels and corrections

To get detailed feedback on every model’s predictions, adding some new labels for post-production continuous analysis and deployments was essential. These collective labels help in an in-depth evaluation of the system’s current state. We generated multiple useful performance metrics, such as helpfulness, coverage, preciseness, new domain %, and irrelevant queries. We gather these labels continuously via integration with our system database table using microservices through an open-source platform known as Label Studio.

Phase 4 & 5 – Monitoring drifts and model updates

After receiving extended feedback with corrected expected labels, we analyze ML performances on production data. It happens on a  continuous-time window to identify the inception of the exact point when a change in data, user behavior, etc., has occurred and metrics get dropped.

If changes required are part of the existing scope of data and model, these new query patterns are quickly incorporated into the model. Else, post data exploration and analysis, business and product requirements, and priorities can be scheduled for the next release of the model.

Once these changes are identified technically, we include them in our training data or make changes in model designs during the next model development phase.

Conclusion

Model building, monitoring, and maintaining the performance of ML systems is not easy in production that are serving critical customer accounts every day. We have seen added complexities in ML systems in the form of unending ML cycles, additional tasks & processes involved in data and model quality checks, constant monitoring in the system for drifts, and the sensitivity of models towards “unseen” data early stages, especially, and so on. After incorporating these processes into our system, we achieved a raise in ML performance by 30%. This increase was a little above where we had started, and it helped keep the performance graph in non-decreasing order.

ML models are as good as the data they get. Its power of representativeness toward the unseen population grows incrementally with time. There is always room for improvement in the initial phases of ML systems. It is when a lot of fluctuations are observed between a model’s first launch at production and when the system is thoroughly familiar with all the critical corner cases and can finally generalize the system. To scale these necessary operations, it is essential to carefully design each system component.

Contributing author: Geetanjali Gubbewad is a Senior Machine Learning Engineer building and scaling cool stuff at Freddy. Optimizing things, including her daily chores, is her hobby. She’s probably planning her next trip as you read this.