The smart guide to (de)dupe dirty data

If you are a James, John, or Robert, you have among the most common names in the United States, making up a demographic larger than the populations of New Hampshire, Hawaii and Idaho combined. If you are a James Smith, you have the most common first and last names in the United States. And you are a nightmare.

Names such as James, Mary and David clog databases everywhere, contributing to a $2.5-3 trillion a year problem. That’s for US companies alone. Along with Bob for Robert, Liz for Elizabeth, and Bill for William, these names add to enormous pools of dirty data, or data that are incomplete, incorrect, outdated, or redundant.

Such data, particularly duplicate or redundant data, pose a significant challenge for companies. Not only do they mislead teams—you don’t want multiple sales agents calling Linda to pitch the same product—but also cost companies millions of dollars for storage space they don’t need.

Data redundancy posed a huge problem for Freshworks as well. Out of 3 million customer leads for Freshworks about a year ago, about 200,000 were potential duplicates. Across 100 million leads for all Freshworks customers, there were potentially about 30 million duplicate pairs.

Freshsales, our AI-powered lead scoring and customer relationship management tool, had a de-duplication, or dedupe, tool to weed out such duplicates. But that tool could check only for duplications bearing the exact same data, such as names spelled the same way. It struggled to identify duplicates of entries with names such as John and Smith. Only 1% of the suggestions it made for merging of duplicate entries would be accurate.

Freshsales needed a smarter solution.

In January 2019, Swaminathan Padmanabhan, director of Data Science at Freshsales, tasked my three-member team, newly created in Bangalore, with building a smart dedupe tool that could be deployed at scale. In under five months, we had a machine-language-powered tool boasting a merge rate of 20%—a significant jump from the earlier 1%.

Our proprietary dedupe tool can analyze multiple fields and ferret out duplicates from more than 200 million records in real-time. It is smart enough to proactively detect duplicate entries even with names like John and Richard by comparing against fields such as phone numbers and company names.

Equipped with this advanced dedupe feature, Freddy, our AI-powered assistant, constantly learns from the data stored in Freshsales to detect and report duplicate leads and contacts, helping users maintain clear and accurate data. Now, more than 10,000 records are either deleted or merged with existing records in Freshsales every week.

A grounds-up challenge

In today’s technology-driven economy, most industries depend on databases to streamline their operations. Inevitable, then, that the quality of the data has significant implications on their business. The ideal database, unfortunately, doesn’t exist.

Database quality is often compromised by errors such as spelling mistakes and typos as well as the use of multiple conventions. Two records could exist for Justin Smith, with an incorrect Smyth entered in one of those. Last names could be entered as first names. International Business Machines could be entered as IBM, the University of Madras as Madras University, and Nexon America as Nexon Co. Ltd. Phone numbers come in all shapes.

The result: the annoying problem of duplicate data and a bloated pipeline.

Duplicates: A human problem

Our task was to build a deduplication feature for Freshsales that, on an elementary level, could:

  1. Find potential duplicates for a particular record and inform the user in real-time;
  2. For a particular record, find other people from the same company;
  3. Help sales teams organize their database.

Most dedupe systems are restricted and can match records only against the exact fields and values such as name, company name, email address, and phone number. This often surfaces incorrect matches if the values are common. Smith is a common name in the US and Canada, and a match on Smith could pull thousands of wrong records as potential duplicates.

We needed a completely different approach. Grounds-up.

Objectives and challenges

We decided on these primary objectives to begin with:

  • Build a robust dedupe tool that would not just scoop out more duplicates in the system but also do it far more accurately;
  • The tool should be capable of identifying potential duplicates as soon as a record is entered or created in the system;
  • It should be able to detect duplicates irrespective of spelling errors, typos, empty fields, punctuation, field mismatches, abbreviations, and variations in how phone numbers are entered;
  • The tool should be able to continuously improve based on user feedback;
  • The new dedupe tool should be built for scale to keep pace with Freshsales’ rapid growth and work for hundreds of millions of records. (To be able to smartly compare every record against every other record is an n2 problem.)

We faced several implementation challenges in building such a system for Freshsales.

  • We had to pre-compute duplicates from more than 100 million records. To do this, we generated over 70 million duplicates and stored them in the system.
  • Our code had to be written in such a way that it could work on top of Spark as well as native Python. We used Spark to compute duplicates for existing records and native Python for real-time deduplication.
  • We had to build a single machine-learning tool that would work across different customer systems. Building a model that could generalize well was not easy.

Three-pronged approach

Most dedupe tools find duplicate records when there is an exact string match. Also, they tend to become slow as the number of records in the database increases. We proposed an integrated system that could search and deduplicate records in real-time as well as run distributed jobs to find and store duplicates for existing records in the database. We also needed to ensure that the increasing number of records did not slow the tool.

Phase I: Setting the stage

Cleaning fields: In the first phase of the project, our focus was on cleaning and transliterating record fields to have an accurate comparison for deduplication. Punctuations, numbers and salutations were removed and the strings lower-cased. We also removed words like company, Pvt, Ltd, corp, enterprise, and LLC; common domains in emails such as aol.net, google.com, and yahoo.com, and common emails like sales, support, and admin.

Transliteration involved transferring a word from the alphabet of one language to another. For this, we detected non-English languages by the presence of unicode characters in the strings and used one-to-one mapping for all Latin and Cryllic characters. For example, акмарал тулегенова is transliterated to Akmaral Tulegenova, allowing the model to figure that they both represent the same person.

Generate blocks: Scaling deduplication to work for millions of records is challenging. Searching through every pair of records in a database for duplicates would drastically increase the computational cost and time. Enter blocks. We created groups of records, or blocks, based on common values so the tool can search for duplicates only within those blocks. The theory behind this method is that any two duplicate records would have at least one common value between them. Blocks also help solve the problem of the tool incorrectly identifying records that have similar entries as non-duplicates.

Tokenize: The simplest way to create a block is to tokenize records and group all the records belonging to a token. Tokens can be created based on fields such as name and email address as well as on parts of every word—two letters, three letters, or n letters of every word. These are called n-grams. For example, if the token is ‘Smith’, then any record that contains ‘Smith’ after tokenization will be grouped under this token.

We built tokens based on first names and last names after lower-casing them and removing any salutations and punctuations. So Mr. John and john would have the same token—john.

We also based tokens on the first five letters of names. This was to ensure that names with the same starting patterns would be captured under the same block. E.g. freshworks and freshwrk.

Tokens also took into account the last six digits of work and mobile numbers so state and country codes at the beginning would not affect block formation. So +91 (987)-654-3210 and 9876543210 would be in the same block.

And finally, for company names, we removed words such as company, Inc, corp, Llc, Ltd, and Pvt. So Apple Inc. and apple pvt ltd would be under the same block.

Block cutoff: Consider the block created based on the token ‘john’. Because that’s such a common name, we might end up with hundreds of records that might be similar and not real duplicates. That would be counterproductive and the purpose of creating blocks to decrease the search effort would be defeated.

Our approach was to ignore such blocks, thus reducing the search space as well as delivering more accurate matches. If the records are true duplicates they might contain the same company name or phone number, which would still find some common blocks within them.

We decided to create a 50-record threshold for each block, allowing for genuine leads with the ‘john’ problem to be grouped in a block using tokens that might not be too common.

Smart block selection: Even after implementing block cutoffs, we found that some large blocks had a lower probability of containing record pairs that were duplicates. So, as an additional measure, we employed a technique to choose records that have more blocks in common to identify probable duplicates.

Finally, to have the best of both methods, we selected blocks from both the methods combined (ensemble technique) to create record pairs and, at the same time, obtain record pairs that would have higher probabilities of being duplicates. That allowed us to create pairs of records that had a very high likelihood of being duplicates. The pairs were then scored against a machine-learning model to get probability scores of them being duplicates.

The Dedupe project

Phase II: Training the model

We employed a machine-learning-based algorithm capable of learning the non-linearity of patterns and the thresholds of features to identify a pair of records as duplicate or not. In particular, we used a random forest model to capture the non-linearity as well as to solve the problem of overfitting. The model is not too specific to the patterns in our training data but can distinguish records in a much more generic way.

To solve the problem of having tagged data, we use a semi-active learning technique in which we first tag the data based on the matching criteria on the emails of two records. Once the model is trained on this data, it is used to predict on a different test set. This is manually curated and fused with the training data for retraining the model. The aim is to feed those patterns for both duplicates and non-duplicates into the model.

Building a training and validation dataset: We lacked a golden dataset. Initially an unsupervised problem, we made it supervised by introducing a tagging heuristic, a self-learning mechanism. The rule was, if the email addresses of two records matched, tag them as a duplicate pair, else as a non-duplicate pair.

The issue with this model is that some pairs of records with different email addresses might be very dissimilar and other pairs very similar. Also, the differentiation learned by the model is a very easy one with a visible distinction. The predictions for cases where the records are only slightly different and still are non-duplicates would be not as expected. Hence, we decided to employ tagging on pairs generated within the blocks, which solves the issue to an extent.

Random forest model: Because of different data types, this model captures non-linear patterns better than, say, a logistic model, and is less prone to overfitting.

Recurring training phase: We trained the model with the tagging mechanism, tested it on a different set of data, identified pairs of records where the predictions went haywire because of the tagging, corrected the tags, passed these records to the training dataset, and finally retrained the model.

Tagging the data based on a heuristic model could result in pairs of records that are duplicates but where the tags are the opposite. When the model is trained on this kind of a dataset, the weights it learns for the different features and the patterns it picks up might not be the correct. Hence, the predictions could go wrong for some test datasets.

So in active learning, we passed a bunch of test data to a colleague who looks at the pairs of records and tags them. This removed the dependency on the heuristic model. When this human-curated data was fed into the model and retrained, the weights learned on the features became more reliable.

Validating the model:

AUC-ROC Curve: We employed this evaluation metric to measure the predictive power of our model.

Average rank: To ensure duplicates are shown above non-duplicates even if their probability scores are low, we measured the average rank of a true duplicate against a record.

Average False Positive Rate: Showing a non-duplicate as a duplicate would defeat the purpose of the system. So we measured the false positive rate and tried to keep it in control by adjusting the threshold on duplicate scores.

Manual validation: We maintained a list of cases of duplicates and non-duplicates that we expect our model to identify. This list was curated based on business needs.

The solution: What our new dedupe system can do

Phase III: Efficiency in storage and access

Now that we had figured out the duplicates we needed to make it accessible to the user. We needed to ensure we stored the data efficiently and on request show the duplicates quickly. To achieve this we decided to store duplicates in graph, which also helped us to pick duplicates in O(1) time complexity.

Storing in graph: A vertex represents a record that has one or more duplicates. An edge between two vertices stores the score predicted by the model, which represents how strong a duplicate pair they are.

Surfacing duplicates: To fetch the duplicates of a particular record, graph is queried using that record’s ID to get the connected vertices. Because vertices are uniquely identifiable, duplicates of a particular record are fetched in constant time. The edge ID encodes the IDs of the two vertices it is connecting. This helps in fetching the duplicate records from the edge ID itself without actually retrieving the connected vertex. In suggesting duplicates to users, the model ranks the results in descending order of relevance or accuracy, with the topmost record being the most likely duplicate.

Merge: This allows users to move the values of a duplicate lead or contact under the primary record and remove the duplicate.

Here’s how this works: Consider an example where there are multiple leads that bear the names Robert Andrews, Bob Andrews, and Rob andrews. When a user opens one of these leads, say, Robert Andrews, Freshsales detects and displays matching records as duplicates. The user can then merge the duplicate records, with Robert Andrews as the primary record.

User feedback: The user has the choice to tag some suggestions as non-duplicates. Our model also has a mechanism to take in those pairs that are merged by the users as duplicates. This feedback dataset is used to retrain the model.

Phase Next

Now that we have a base deduplication system in place, we see great potential to extend it to be a more powerful tool. Future versions should be able to customize suggestions based on customer requirements and feedback.

For one, we want to build smart connections that can help in identifying people working in the same organization, thus increasing discoverability of leads. Also, different customers have different needs. Say, if a customer wants to only see duplicates based on spelling mistakes and not phonetic mismatches, the dedupe tool should be able to do that. We are currently exploring methods to offer users this flexibility.

We will keep you posted!

Co-authors: Bharathi Balasubramaniam and Srivatsa Narasimha

Writing by: Feroze Jamal