Text Deduplication in SQL
Data deduplication is essential when importing similar data from different sources. Different providers store data differently, and several variations (both correct and incorrect) exist in the English language for names of people, companies, and entities in general. Deduplication is often made easier if there is a lot of other information associated with the data because it gives you several things to compare to identify a dupe (such as birthday for people, location for company, etc.). When you’re trying to identify duplicate names only, things get a bit tricky.
For my purposes, I’m trying to deduplicate lists of musical artists that I’ve gathered from many different websites. A quick-and-dirty method of comparing unequal but similar strings is to strip out any special (i.e. non-alphanumerical) characters and then compare them. This is a fantastic technique because it is a (relatively) quick operation and yields very good results.
SELECT a1.artistId, a1.artist, a2.artistId, a2.artist FROM artists a1 INNER JOIN artists a2 ON REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(LOWER(a1.artist),'the ',''),'a ',''),'.',''),'& ',''),'and ',''),'-','') LIKE REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(LOWER(a2.artist),'the ',''),'a ',''),'.',''),'& ',''),'and ',''),'-','');
Note that I also remove articles because they are often omitted, and the word “and” because it could be interchanged with ”&“.
Now this technique finds dupes in a lot of cases, but not all. If one source omits an artist’s nickname (“Damian Marley” vs “Damian ‘Jr. Gong’ Marley”), or a prefix in their name (“Lauryn Hill” vs “Ms. Lauryn Hill” vs “Miss Lauryn Hill”), this technique will not identify the duplicates.
Inspired by this post, I decided to build a table of rolling hashes and use the number of hits for different artists to identify duplicates. Rolling hashes move a window of fixed length through a string to generate small hashes for comparison. Using a window size of 4 (sort of arbitrarily chosen), I could break up the name “Lauryn Hill” like so (converting to lower case to aid comparison):
lauryn hill [laur] [aury] [uryn] [ryn ] [yn h] [n hi] [ hil] [hill]
To do this for the entire collection of artists, I can loop through the table until my hash window reaches the end of the longest string (using ”row_count()” to determine when that happens):
DECLARE windowSize INT; DECLARE idx INT; SET windowSize = 4; SET idx = 1; INSERT INTO hashes(artistId,hash) SELECT artistId, artist FROM artists WHERE CHAR_LENGTH(artist) < (windowSize); INSERT INTO hashes(artistId,hash) SELECT artistId, substr(artist,idx,windowSize) FROM artists WHERE CHAR_LENGTH(substr(artist,idx,windowSize)) >= (windowSize); while ROW_COUNT() > 0 do SET idx = idx + 1; INSERT INTO hashes(artistId,hash) SELECT artistId, substr(artist,idx,windowSize) FROM artists WHERE CHAR_LENGTH(substr(artist,idx,windowSize)) >= (windowSize); END while;
An important part of deduplication is determining how likely something is actually a dupe. I figured that the more hashes an artist matches with another artist, the more likely it was a dupe. Using this, I can order my matches by that difference so I can deal with the most likely dupes first.
INSERT INTO matches(artistId1,artistId2,matches) SELECT h1.artistId, h2.artistId, COUNT(1) AS matches FROM hashes h1 INNER JOIN hashes h2 ON h1.hash = h2.hash AND h1.artistId < h2.artistId WHERE NOT EXISTS ( SELECT 1 FROM falsePositives fp WHERE fp.artistId1 = h1.artistId AND fp.artistId2 = h2.artistId ) GROUP BY h1.artistId,h2.artistId HAVING COUNT(1) > 1; SELECT m.artistId1, a1.artist AS artist1, m.artistId2, a2.artist AS artist2, matches, (SELECT COUNT(1) FROM hashes h WHERE h.artistId = m.artistId1) - matches + (SELECT COUNT(1) FROM hashes h WHERE h.artistId = m.artistId2) - matches AS diff FROM matches m INNER JOIN artists a1 ON a1.artistId = m.artistId1 INNER JOIN artists a2 ON a2.artistId = m.artistId2 ORDER BY diff ASC;
A couple notes on performance. For an artist table size of around 1300 rows, my hashes table had around 20k rows. This will obviously vary depending on your chosen hash window size and the length of names you’re comparing. Additionally, indexing your hashing table is extremely important. An index on the “hash” column was critical because you’re essentially doing n^2 comparisons when you join on itself, but I also put an index on “artistId” to make the aggregate in the final query faster.
This code in MySQL stored procedures, as well as the schemas for the tables used, can be found on GitHub.
← Mapping the Snapchat Data Leak