How to finetune a pre-trained model for semantic search

Ashutosh DwivediContent Writer

Mar 26, 20236 MIN READ

A pre-trained model in natural language programming consists of a saved neural network that was previously trained on a large dataset.

While neural network-based models offer state-of-the-art accuracy in searching documents, their out-of-the-box functionality may not replicate similar search performance on industry documents. This is primarily due to the differences between the domain-specific queries and documents as compared to the open-source datasets.

Building on our previous insights on crafting a monolithic search service, this blog post demonstrates a method to finetune the pre-trained search model on a custom domain with limited data.

Why we use pre-trained models for search

Traditionally, search involved extracting linguistic features such as intent, keywords, etc., from the user query and matching them with weighted scoring. Developing these proprietary algorithms requires considerable effort and domain expertise. But with the advent of transformers-based large language models (LLMs), crafting these linguistic features is no longer necessary. There exist LLMs that have been trained to compute semantic similarity between query and document.

For instance, Google replaced its traditional search with a transformer-based model in 2020. In addition to reducing effort on extracting linguistic features, these models can also fetch documents that are semantically as well as lexically similar. Thus, modern search engines rely on training LLMs to fetch relevant documents for a user query. After using a transformer-based model for Freshservice search, we noticed a 20% improvement in search performance and simplification of our search system.

Limitations of pre-trained models

While these pre-trained models sound good, they have certain limitations:

Domain drift: They are trained on open-source datasets, usually crawled on news and wiki articles that may not represent the industry document space. Thus, they don’t understand terms like “freshservice,” which is an entity specific to Freshworks and its customers.
Improving models: No machine learning (ML) model is 100% accurate. This presents the challenge of improving model performance for our customers without having a large labeled dataset to train on.

To solve the challenges of domain drift and improving models, we have outlined an approach that relies on preparing queries from the documents of the ITSM domain and then training our model on these queries for domain understanding. This process can be replicated across all domains.

Contrastive Training

We use contrastive learning to finetune the pre-trained model to our domain.

For this technique, we need training data of the triplets <query, positive_sample, negative_sample>, where query represents the user query, positive_sample represents the text of the relevant document and negative_sample represents the text of a document dissimilar to our query. Since this model or system is not released, we rely on the techniques mentioned in the following sections to get the training data required to finetune this model.

Generating Queries using T5

The Information Retrieval (IR) datasets such as MS Marco map a document to a relevant user query. Traditionally, the queries are used as search input to fetch relevant documents from large corpora. What if we could reverse this mapping and train a model to generate a query given a document? Sentence Transformers does exactly that on top of the T5 model. We use this pre-trained model to generate queries on the documents of our domain. We usually select a sample of the most-used documents (~5k) and get ~15000 queries. This finetuning leads to an improvement of 8% in precision (refer to search metrics).

Persisting Problems

Lack of implicit link between query and relevant service request (SR) or solution article (SA): Traditional IR queries are information-seeking or navigational. For instance, “What is the leave policy?” or “How to get VPN access?” But in ITSM, a lot of queries are related to troubleshooting or employee needs. For instance, our queries usually look like “My VPN is not working” or “Need pycharm access”. And the response to some of the latter queries may be an SR of “IntelliJ Access”. Thus, the link between the service request and query may not be present in the text but in the domain understanding. And the t5 model cannot generate such queries.
Relevance on words not important to our domain: In a usual IR system, words like “access” or “request” are assigned a higher weightage. And thus, when a query contains these terms, documents that also have these terms have a higher retrieval score. But in ITSM, roughly 25% of all SRs contain these terms, leading to a lot of false positives. For instance, when the query which has only one relevant SR is triggered, the pre-trained model would assign high scores to two other non-related SRs as well just because of the presence of words like “access” or “request.”

We solve these two problems in the following ways:

Handcrafting queries: We realized that for matching documents and queries which had low content similarity and high semantic similarity pertaining to the domain, we would need curated training samples. However, we did not have good-quality labeled data. Furthermore, it is essential for a contrastive training approach to have good negatives in the training dataset. Thus, we identified a set of topics that had high-frequency queries and were failing with our pre-trained model. For these topics, we hand-crafted queries and added the wrong documents returned by our pre-trained model as negatives. We created ~10 queries/topic to achieve optimal performance.
Creating templates for queries: As previously mentioned, 25% of our SRs contain the word “request” or “access”. These SRs are usually written like “Software Access Request” or “VPN Access.” Further, we observed that user queries for these SRs also follow a common pattern of “Need VPN access”, “photoshop request”, etc. Thus, for these sets of SRs, we created a template out of query patterns and generated queries based on SR content and this template. From the set of 5k documents (SR and SA) we selected for generating training data, we selected the 25% SRs that contained these words and generated ~1200 queries. The addition of these queries allowed our model to place less importance on the words “request” or “access” and more importance on entities such as “photoshop.”

We present the outcome of adding these queries to our training set in the following table. The validation set consists of real-world queries with corresponding correctly labeled documents, created by annotators. We measure the precision and recall for retrieval as well as the hit rate for the top 3 results.

Approach	Precision @ 1	Precision @ 3	Recall @ 1	Recall @ 3	F1 @ 1	F1 @ 3
t5 generated	+1.05%	+0.55%	+0.77%	-0.05%	+0.73%	+0.32%
t5 + templates	+1.60%	+1.25%	+1.27%	+1.23%	+1.41%	+1.26%
t5 + handcrafted	+1.10%	+1.58%	+0.98%	+1.55%	+1.47%	+1.59%
t5 + templates + handcrafted	+3.30%	+1.78%	+2.46%	+2.04%	+2.34%	+1.91%

Using the method explained in the section above, we got two major advantages:

Performance boost in queries that were previously out of the domain for the pre-trained model. We now have a method using which we can make pinpoint efforts in improving the performance of our model.
Improved score distinction b/w true positives (TP) and false positives (FP). The above scores are computed on the ‘0’ score threshold. In production, however, we implement a threshold on the score to curtail false positives in lieu of recall. With the training method, the score of TPs increased and FPs decreased, allowing us to curtail FPs with less loss of recall. Refer to the following graph for the scoring, the left graph represents the pre-trained model and the right represents the model trained on all types of queries. The pink scores are false positives and the green ones are true positives. In the left (pre-trained) image, the fp score starts to peak at 1.3 – 1.45 score, whereas in the right (fine-tined) image, the fp score peaks at 1.2 – 1.35.

Conclusion

In this blog, we present a domain adaptation method to fine-tune a pre-trained search model. We highlight the importance of writing good-quality training samples to achieve the desired model output. The approach presented above can be translated to any domain in order to improve the understanding of specific words and topics. Stay tuned for the blog on how we shipped this model to a search system with over a million documents.

Primary author: Sugam Garg was a Senior Data Scientist in the AI agent (Freddy) team at Freshworks. He’s passionate about solving AI problems, reading about the latest technologies, and sharing his knowledge. When he’s not coding, Sugam usually spends his time watching football and traveling. You can reach out to him on LinkedIn or mail him at: sugam110795@gmail.com

How to finetune a pre-trained model for semantic search

Start using Freshworks today!

Related articles

All aboard Ember: How we ensured app robustness in our migration

How I deleted 10,000 secrets from Kubernetes in a few minutes