Background: I have a list that contains 13,000 records of human names, some of them are duplicates and I want to find out the similar ones to do the manual duplication process.
For an array like:
["jeff","Jeff","mandy","king","queen"]
What would be an efficient way to get:
[["jeff","Jeff"]]
Explanation ["jeff","Jeff"] since their Levenshtein distance is 1(which can be variable like 3).
/*
Working but a slow solution
*/
function extractSimilarNames(uniqueNames) {
let similarNamesGroup = [];
for (let i = 0; i < uniqueNames.length; i++) {
//compare with the rest of the array
const currentName = uniqueNames[i];
let suspiciousNames = [];
for (let j = i + 1; j < uniqueNames.length; j++) {
const matchingName = uniqueNames[j];
if (isInLevenshteinRange(currentName, matchingName, 1)) {
suspiciousNames.push(matchingName);
removeElementFromArray(uniqueNames, matchingName);
removeElementFromArray(uniqueNames, currentName);
i--;
j--;
}
}
if (suspiciousNames.length > 0) {
suspiciousNames.push(currentName);
}
}
return similarNamesGroup;
}
I want to find the similarity via Levenshtein distance, not only the lower/uppercase similarity
I already find one of the fastest Levenshtein implementation but it still takes me to 35 mins to get the result of 13000 items list.
removeElementFromArrayfunction is killing your performance because it mutates the array that you're traversing. Remove the 4 lines aftersuspiciousNames.push(matchingName);and test the performance usingconsole.timeandconsole.timeEnd, preferably on a smaller array to begin with.["Jeff", "eff", "effl"]? Also, are you only interested in a Levenshtein distance of 1 or could it be variable?