How we used NLP to get sentiment from emails for predicting deal scores
Email sentiment prediction is a well-known problem in machine learning. But generating business value from the emails requires more focus on data engineering and model deployment in the context of the product for which it will be used. In this blog, we share our approach to the various stages of deploying an ML model in production. The following is the list of the topics covered.
- Modeling the problem statement
- Exploratory data analysis
- Feature preprocessing
- Model training
- Deploying the model
- Room for further improvement
Modeling the problem statement
The primary users of Freshsales are the sales agents. To pitch the product, they reach out to the point of contact for other businesses and inquire about their requirements related to their product, request them to schedule a meeting for a demo, etc. These contacts, who are potential customers, then go and study the product, gather requirements, and reply back with their findings. These conversations become a powerful signal in determining the direction in which the deal is going. It could help us predict the chances of winning or losing the deal.
Thus, our problem statement for business is to predict the probability of winning or losing the deal based on email conversations between agents and contacts.
Translating the problem statement for the ML model
To model this, we need to understand the following entities: agent, contact, deal, and email conversation.
What happens on the business side
The “Agent” gets to know about a company called ABC Corp and the email address of its HR. They add this information as a “Contact” on Freshsales (contact). The agent then creates a “Deal” (deal3) on Freshsales and associates contact3 with deal3.
After creating the deal and associating it with a contact, they send an email from either Freshsales or any other email client asking to show the demo. contact3 sees the email, and after a few conversations, they decide to purchase the plan.
Then, the agent marks the deal as won on their Freshsales account.
What happens at the backend
Contact3 is added to the contacts table with its email address. Deal3 gets created in the deals table, with its current state – Open. Along with that, a contact-deal-association table registers deal3 as associated with contact3.
Then, when emails are exchanged, they are registered in the email-conversations table against Contact and not against Deal. When the deal closes, its state is changed from Open to Won.
Now that we have developed an understanding of this process let’s try to define the objective of our ML models. We trained two models. The first model (L1 model) predicts the sentiment of a single email. The second model (L2 model) takes sentiment from a series of emails and predicts the probability of winning the deal.
How to get labeled data for training
There is no direct mapping between conversations and the related deal, but we can relate them by joining email-conversation with contact-deal-association table. Finally, we have the outcome for closed deals: Won or Lost, which becomes our label.
The following table displays the data:
deal_id conversation_id text speaker label
1 1 Does Wednesday 5 PM work for you? agent 0
1 2 Sorry, I am out of office next week. customer 0
2 3 Can you share the SSO link? customer 1
2 4 We are still evaluating the product. customer 1
Exploratory data analysis (EDA)
Language
We tried understanding the language distribution across accounts to decide if we needed a multilingual language model. For each email, we get probabilities for the top language using langdetect module. And for each language, we take average probabilities across all the emails grouped by accounts.
- Out of 1166 accounts, 359 accounts have a probability of English less than 0.9.
- These accounts contribute to 20% of emails and 25% of deals.
- The distribution of languages in these 359 accounts is as shown in the following plot:
This analysis suggests that a multilingual model should be preferred.
Length of sentences
The following table displays how some of these emails look:
Serial no. | Text | n_words |
---|---|---|
0 | Sounds good | 2 |
1 | Perfeito seguiremos assim | 3 |
2 | Sure no problem | 3 |
3 | Perfect | 1 |
5 | Can we schedule a demo for a senior team member that was unable to attend our first meeting she is available friday from am pm eastern time and is cc would on this email | 34 |
6 | I know we signed up week ago however we have really not had time to test it out fully we started testing days ago fully would it possible to extend the trial for us again | 35 |
What should be the max length of the tokens for training the model? For this, we look at statistics on the number of words in emails. 90% of sentences have less than 170 words, and 95% have less than 300 words.
After removing 5% outliers, the average num of words is 54. All the statistics are similar for both classes.
Another interesting thing to note is that the average number of words from customer emails is 28, whereas, for agents, it is 48.
Some numbers related to the dataset are illustrated in the following table.
Conversations with label 1 | 660K |
Conversations with label 0 | 1246K |
Deals with label 1 | 85.8K |
Deals with label 0 | 120K |
Number of accounts | 1166 |
Data cleaning, feature preprocessing, feature engineering
The win rate of an account is the ratio of won deals to the total number of deals. We have removed data from accounts that have a win rate> 0.9 or a win rate < 0.1
We have removed emails that had less than 4 words. Most consist of words like “approved,” “okay, thanks,” etc.
Each deal must have at least 1 email from a customer. And for every deal, we consider the emails only till the last customer email. Since emails from agents usually tend to sound more positive, we found that including all the agents’ emails leads to predicting a higher probability of winning the deal. Hence, emails from the agent after the last customer email are not considered.
To clean the email text, we use an in-house bert-based module that identifies parts of emails that contain signatures, salutations, and disclaimer texts and removes them from the email.
We added a token denoting the email’s sender: __cust__ if the sender is a customer and __agent__ if it is from an agent. All the emails are prepended by the corresponding token.
We also tried to add account id before the text, assuming that the model will learn to predict sentiment by taking account id into context. But this experiment did not give much improvement in the score.
Model training
For training the conversation level sentiment model, we used the Trainer API. This allowed us to easily experiment with various pre-trained NLP models, configure tensorboard for monitoring, and Mlflow for logging parameters and metrics. We benefited from features like saving checkpoints, evaluating after several steps, and early stopping.
Most importantly, we used Multi-GPU training that is handled by API. It uses data parallelism internally.
For the initial setup, we first started by training distilbert-base-uncased on English language accounts only. Then we trained a model that could also give good results for other languages. For this, we experimented with xlm-roberta and distilbert-base-multilingual-case by taking the last 5 agent and last 5 customer emails for all the deals. We found that xlm-roberta gave slightly better results, so finally, we trained it using the previous 20 agents and the last 20 customer emails.
We used a p3.8xlarge instance for training with 4 Tesla V100 GPUs. With a sequence length of 128 and per device batch size of 128, we trained for 6 epochs, which took 7.5 hours on a dataset of 1.05M conversations.
Once we had trained a conversation-level model, we needed to introduce another model using predicted sentiment values of a series of conversations to predict the probability of winning the deal. However, using the same dataset to train both models will cause the second model to overfit. To avoid this issue, we trained the first model 5 times, holding out 20% of the data each time.
Consider a dataset of 10 conversation ids. Exclude them while training the first model to get scores for ids 0 and 1. To get scores for conv 2 and 3, train the first model again but exclude these ids.
- training_ids: [2, 3, 4, 5, 6, 7, 8, 9] val_ids: [0, 1]
- training_ids: [0, 1, 4, 5, 6, 7, 8, 9] val_ids: [2, 3]
- training_ids: [0, 1, 2, 3, 6, 7, 8, 9] val_ids: [4, 5]
- training_ids: [0, 1, 2, 3, 4, 5, 8, 9] val_ids: [6, 7]
- training_ids: [0, 1, 2, 3, 4, 5, 6, 7] val_ids: [8, 9]
Now use the predicted scores of these val_ids to train the second model.
The second model is an XGBooost model. We take an array of the last 5 sentiment scores of agent emails and the last 5 sentiment scores of customer emails. The following table displays the data:
deal_id | agent_scores | customer_scores | label |
---|---|---|---|
1 | [0.3, 0.1, 0.8, 0.2, NaN] | [0.4, 0.5, 0.3, 0.7, 0.1] | 1 |
2 | [0.5, 0.2, NaN, NaN, NaN] | [0.1, 0.2, 0.2, NaN, NaN] | 0 |
Metrics
- L1 model (xlm-roberta ) : AUC – 0.72, accuracy: 0.67
- L2 model (XGBooost): AUC – 0.84, accuracy: 0.77
Deploying the model
We used onnx optimization to reduce the latency of the L1 model. The chart shows the improvement in inference time with and without onnx optimization. The AUC of the quantized + optimized onnx model came to 0.8393, which is a very nominal difference.
Throughput (a measure of how many units of information a system can process in a given amount of time) of the deployed model is approximately 53 requests per second.
The xlm-roberta model mentioned above and the Xgboost model are deployed using Sagemaker. We get the emails from the payload through the Kafka stream, and the consumer sends requests to the Sagemaker endpoint.
Room for further improvement
Emails like Out of office responses and meeting invites are not suggestive of the sentiment. We plan to remove such emails in the next version.
Our colleagues Srivatsa Narimha and Chetan Bhat have designed an architecture called CAPE-NET to classify email threads, which can be explored for our use case.
For the L2 model, we have simply used the last 5 agent and customer sentiment scores. We plan to experiment with the arrangement of scores and use some more features.
Note of Thanks: I would like to thank Suvrat Hiran, Srivatsa Narasimha, and Chetan Bhat for their guidance throughout the project.