Decoding explainable AI using phrase extraction and multi-task learning

Explainable AI consists of tools and frameworks that enable developers to understand and interpret predictions generated by machine learning (ML) models.

Interpretable insights are easily understood by users. In fact, there have been significant efforts and research to develop state-of-the-art machine learning models just to interpret machine learning predictions. Consequently, as predictive models become more and more “cutting edge,” they also behave more and more like “black boxes,” which results in users finding it difficult to understand the reasoning behind these predictions. 

Explainable AI helps in translating numerical predictions into human-readable and interpretable formats. In our use case, one way to use explainable AI in order to help assess the probability of the deal conversion scores is by providing supportive reasoning as to why the score is either high or low. These human-readable reasons can be easily understood by the agents based on which the future course of action can be decided to improve the health of the deal. 

In the process of improving explainability, we realized that a lot more information can be extracted from the interactions between the customer and the agent to augment the deal scores. Emails, chat messages, and voice transcripts are a treasure trove of context and information. For this project, we restricted ourselves to email interactions only, and we would like to extend this to chat messages and calls in the future. 

In the following scenario, you can see an excerpt from a real interaction between the sales agent and the customer. The deal score is 37 for this particular deal and has an “at-risk” tag. You can see that in the interaction, the customer is not happy with some of the features of the service desk product. The customer opines: “It seems that we lose the access to Employee onboarding when we do so” or “reluctant to move across to it if we have to re-create this feature ourselves” and also because of the lack of immediate response from the agent: “Disappointed that once again I’m having to chase for an update.” 

Now, if we could somehow extract these key action points about a particular feature of the product that the customer likes or dislikes or something that describes what the customer is feeling at that current moment and display them along with the deal score, it would help the agent to better understand the deal score and give better visibility into the behavior of the customer.

Explainable AI conversation

Consequently, we focused on developing a phrase extraction model as part of the explainable AI feature for deal insights.

A background in Freshsales

Freshsales, our CRM product, has various AI-based sales assistant features to help the sales agent during negotiations with customers to increase the chances of deal closure, thus increasing sales and revenue. Deal Insights, an AI feature, helps the agent make informed decisions when dealing with a potential customer.

Freddy AI insights

Sentiment classification

As part of the set of signals that go into the deal insights model, we include email sentiment signals predicted from a separate sentiment classification model. This predicts a deal score based solely on the most recent emails. The model has been trained on the “Won” or “Lost” flag of the deals the emails are a part of.

Explainable AI features

Although sentiment scores (whether a deal is going to win or lose based on the sentiment of an email) are included as signals to predict the deal score, these interactions can also play an essential role in improving explainability. 

Explainable AI: Phrase extraction

Our methodology involves a supervised technique where we fine-tune the pre-trained BERT model using a golden dataset of emails tagged with key phrases by a dedicated tagging team.

We used DistilBERT, a small, fast, cheap, and light transformer model trained by distilling the base BERT model. It has 40% fewer parameters than Bert-base-uncased and runs 60% faster while preserving over 95% of BERT’s performance as measured on various language modeling tasks.

We converted the process of phrase extraction to a span boundary detection problem. Then, we trained a DistilBERT model to learn the start and end indices of key phrases for the corresponding emails. This process converts the extraction problem to a binary classification problem where the predicted start and end indices are compared against the actual start and end indices of the tagged phrases to calculate the cross-entropy loss. The sum of the start loss and end loss is then backpropagated.

Explainable AI phrase extraction

During training, emails are passed into the DistilBERT model, which produces embeddings of fixed dimension (768) for all the tokens. Specifically, it provides two types of embeddings – an embedding each for all the tokens in the sequence and a pooled embedding that represents the entire email: 

  • Sequence embedding of shape ( Batch Size X Sequence Length X 768 )
  • Pooled embedding [CLS token] of shape ( Batch Size X 768 )

Let n be the sequence length (number of tokens/words in a sentence/email)and h be the token-wise output of the BERT layer of shape – Batch Size X Sequence Length X 768.

Sequence length is a hyperparameter that we set based on the length of the majority of the emails in the training set.

We take this sequence embedding and pass it through a linear layer. The linear layer aims to squish the 768 dimensions of token representation in 2D representation, the first dimension meant for identifying the start token index and the second dimension for the end token index. The linear layer has a hidden dimension of 2. So, if W is the weight matrix of the linear layer (768 X 2), the output embedding of the linear layer hl is:This will be of shape Batch Size X Sequence Length X 2.  This embedding will be squished into two parts of shape Batch Size X Sequence Length X 1 . Let them be denoted as hs and he.How start and end token classifiers work

Start and end tokens

A softmax layer produces a probability distribution over all the words to obtain normalized logits.cross entrophy

The logit value of all the words in the sequence (n tokens) is compared with the actual starting position of the tagged key phrase using a cross-entropy loss function.A similar process identifies the end token/word where a separate end token classifier is used.

Explainable AI softmax graph

The total loss is backpropagated to fine-tune the weights of the two classifiers and DistilBERT.Multi-task training in sentiment classification and keyphrase extraction

To understand multi-task training, please refer to one of our earlier blogs

Learning to identify the essential parts of an email during extraction might also help to improve the classification performance as the model would now be able to look at those significant parts of the email to predict deal closure. The reverse is also true, where the phrases will then be more related to the deal closure outcome and help make the displayed phrases understandable.

Explainable AI keyphrase extraction

Intuitively, the information embedded in the output of the DistilBERT model is shared across both the classification and the phrase extraction tasks. This way, learned patterns from one task are now accessible to the other and vice versa. The total loss is a linear combination of classification and extraction losses. Alpha and beta are the weights to control the contribution of the two losses – L(classification) and L(phrase) towards the total loss. So, the DistilBERT weights are fine-tuned by learning from the errors occurring through the two tasks.

Freddy AI deal prediction

This is how the final architecture with the multi-task training pipeline was incorporated. We can see that the email’s sentiment score is used as a signal in the deal insights model, which gives out the final deal score. At the same time, it also gives out the extracted phrase, which is displayed as part of the explainable AI features – deal tag and the reason for the deal tag, as shown earlier.

Performance metrics 

Jaccard Score for phrase extraction: We used the Jaccard score to evaluate the quality of the phrases predicted by our model. Jaccard score is the ratio of the number of common tokens between two strings and the number of unique tokens from the two strings.

Performance metrics

  • Set A – set of unique tokens of string 1
  • Set B – set of unique tokens of string 2

AUC-ROC for sentiment classification: We employed the AUC-ROC curve to measure the predictive power of our model.

The following tables provide details on our test set of 35K emails:

Separate phrase extraction and sentiment classification modelsValues
Sentiment classification AUC0.79
Phrase extraction average Jaccard score0.67

Joint training of sentiment and extraction models with varying alpha and beta.

Beta Values Alpha ValuesExtraction Jaccard ScoreClassification AUC
0.90.10.670.58
0.80.20.690.61
0.70.30.690.65
0.60.40.700.73
0.50.50.710.81
0.40.60.700.83
0.30.70.630.81
0.20.80.600.81
0.10.90.550.79

At alpha and beta values of 0.6 and 0.4, the Jaccard score and classification AUC are higher by 4.4 % and 5 %, respectively, than the individual model metrics. As the phrase extraction performance gradually increases, you can see that the corresponding sentiment classification performance also improves. Of course, after a certain point, though, too much contribution of an individual loss towards the total loss is detrimental to the other task. This shows that by sharing the DistilBERT model parameters, the phrase extraction pipeline is helping to improve the classification performance and vice versa.

Comparison of phrases between the separate model and the multi-task model

EmailSeparate ModelMulti-task Model
I just accepted the morning meeting. We are at a person company looking for a strong CRM plus support platform.We are at a person company looking for a strong CRMLooking for a strong CRM plus service and support platform
My apologies for the delay in getting this processed! We have been waiting for grant funding, which can be a timely process! I will keep you posted, but you should be receiving additional information and a PO within a week.Waiting for grant fundingShould be receiving additional information and a PO within the week.
I will be adding another agent today to Freshdesk, which will bring the total to agents.Adding another agent todayAdding another agent today to Freshdesk
Thank you for letting me know. I appreciate your prompt response.Appreciate your prompt responseAppreciate your prompt response

You can see that the multi-task model’s predicted phrases are a bit more logical and nuanced compared to those of the separate model. 

Here are a few more examples where the model does not predict anything when there are no important points in the email.

EmailPredicted Phrase
CA has accepted this invitation. Freshcaller q & a, when fri sep : pm - sat sep, : am indian standard time - kolkata
This pharmacy is now bought by another chain called "XXXX" that is its name not number; management changes

Conclusion

Treating phrase extraction as a supervised span detection task helps extract relevant, coherent, and domain-specific important phrases compared to unsupervised techniques.

Multitask learning helps to improve both classification and extraction tasks. Results show that the BERT model, whose parameters are shared, learns from both tasks, has a complementary effect on the performance, and outperforms the individual models.

As far as future improvements are concerned, we are working on a multiple phrase extraction model which will extract multiple key points from emails carrying different contexts.

 

Contributing author: Chetan Bhat is a data science leader with many years of data-driven problem-solving experience across industries, job functions, and complex problems. He currently drives data science (Freddy) for Freshworks CRM, which encompasses a host of in-product machine learning capabilities powered by deep learning, deep natural language processing (NLP), representation learning, and explainable AI.