3

For my current project in C#, I am tasked with fetching customer details from a data source, 'cleansing' said customers (making sure the name is capitalised correctly, mobile formatted correctly, etc.), and then finding duplicate contacts and grouping them together. After grouping, the data is sent to another source where it handles merging duplicate customers.

I have successfully fetched and cleansed customers, but am now stuck on how I should be finding and grouping duplicates. Duplicates are defined as customer objects that have either:

  • the same mobile, OR
  • the same email

For instance, take the following dummy data (which is stored in a List<Customer> and each object is a Customer object):

Object 1
FName: Taylor
LName: Doe
Email: (empty)
Mobile: 0400111222

Object 2
FName: (empty)
LName: Doe
Email: [email protected]
Mobile: (empty)

Object 3
FName: John
LName: Smith
Email: [email protected]
Mobile: 0400999888

Object 4
FName: Taylor
LName: Doe
Email: [email protected]
Mobile: 0411222333

Object 5
FName: Joh
LName: Smith
Email: [email protected]
Mobile: 0400999887

Object 6
FName: (empty)
LName: (empty)
Email: [email protected]
Mobile: 0400111222

Object 7
FName: Taylor
LName: Doe
Email: [email protected]
Mobile: 0411222333

Object 8
FName: Jane
LName: Johnson
Email: [email protected]
Mobile: 0400789789

The algorithm should then group objects 1, 2, 4, 6 and 7 together as there is a common email OR common mobile that links these together. Objects 3 and 5 should be grouped together, as there is a common email, and object 8 had no duplicates found.

All other parameters are ignored, such as first name and last name, as from looking at the data, it is quite common that customers misspell their details. Originally, the system in which this data is stored did not force customers to enter information, so some contacts have missing names, emails or mobiles (which is why we are looking at mobile OR email, instead of mobile AND email). At some point this was updated to ensure both fields are populated, but on older customer contacts, the data is still missing.

What is the best way to approach this? It's important to note that the initial pull of data from the source will generate around 46,000+ customers which need to be sorted through.

The end format of the grouping is currently undefined - whether it's a list of list of objects, hashset, dictionary, etc. it doesn't really matter. Whatever algorithm ends up working we can adjust other processes to read the result.

Based on some quick research, I think hierarchical clustering may be an option... Could anyone potentially provide some insight as where it might be best to start? Just feeling a bit lost as to which direction to look in.

7
  • codeproject.com/articles/… / visualstudiomagazine.com/articles/2023/12/01/… Commented Oct 27 at 4:40
  • "The algorithm should then group objects 1, 2, 4, 6 and 7 together as there is a common email OR common mobile" not true for 1 and 2 as far as I can see? Commented Oct 27 at 6:28
  • "Email: (empty)" means there is the string (empty) or just "" (the empty string)? Commented Oct 27 at 6:31
  • @Fildor 1 and 6 have same mobile, 6 and 7 have same email, 7 and 4 have same mobile, 4 and 2 have same email. thus objects number 1, 2, 4, 6, 7 are in one group. Commented Oct 27 at 6:33
  • @CDnX Exactly. That should raise an eyebrow shouldn't it? Commented Oct 27 at 6:35

3 Answers 3

3

Fun problem! Here's another solution:

var byMobile = customers.Where(c => c.Mobile != "").ToLookup(c => c.Mobile);
var byEmail = customers.Where(c => c.Email != "").ToLookup(c => c.Email);

var groups = new List<HashSet<Customer>>();

foreach (var customer in customers) {
    if (groups.Any(g => g.Contains(customer)))
       continue; // skip customers which have already been grouped

    var group = new HashSet<Customer> { customer };

    FillGroupWithMatching(customer, group);

    groups.Add(group);
}

void FillGroupWithMatching(Customer customer, HashSet<Customer> group) {
    foreach (var c in byMobile[customer.Mobile].Concat(byEmail[customer.Email])) {
        if (group.Add(c)) // only recurse when c is new
            FillGroupWithMatching(c, group);
    }
}

https://dotnetfiddle.net/dvI9Wk

Not sure how it performs with 40k customers, but I believe the bulk of the work is done by the ToLookup() calls. Assuming only small groups of customers with the same email/mobile, the recursive looping shouldn't be a problem performance-wise or cause any stackoverflows.

However, the check groups.Any(g => g.Contains(customer)) will progressively get slower the more groups there are. It can be made O(1) by introducing a bool flag on Customer (e.g. HasBeenGrouped) which is set to true when a customer was added to a group (i.e. when group.Add(c) returns true).

Sign up to request clarification or add additional context in comments.

1 Comment

I tested your code with some edge cases and stack got overflown on group with 6000+ entities. It seems it wouldn't be matter on questioner's dataset tho.
3

With this naive code and some magic of LINQ AsParallel(), I successfully grouped 100000 random entities in about 4 minutes of time and 350 MB of memories, on my laptop.

I think your dataset is not big enough to require data science wizardary.

using System.Text;

string[] emailProviders = ["gmail.com", "outlook.com", "icloud.com", "yahoo.com", "company.mail"];
char[] validChars = [.. "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"];

Random rand = new();

IEnumerable<string> GetRandEmails()
{
    StringBuilder sb = new();

    while (true)
    {
        sb.Clear();
        foreach (var ch in Enumerable.Repeat(validChars, rand.Next(10, 20)).Select(arr => arr[rand.Next(arr.Length)]))
            sb.Append(ch);
        sb.Append('@');
        sb.Append(emailProviders[rand.Next(emailProviders.Length)]);
        yield return sb.ToString();
    }
}

string[] emails = [.. GetRandEmails().Take(1000000)];
string[] phones = [.. Enumerable.Repeat("0123456789".ToArray(), 362880).Select(s => { rand.Shuffle(s); return new string(s); })];

(string?, string?, string)[] data = [.. Enumerable.Repeat(string.Empty, 100000)
    .Select(_ => rand.Next(5))
    .Select(i => (
        i == 0 ? null : emails[rand.Next(emails.Length)],
        i == 1 ? null : phones[rand.Next(phones.Length)],
        new string([.. Enumerable.Repeat(validChars, rand.Next(10, 20)).Select(arr => arr[rand.Next(arr.Length)])])
    ))];

Dictionary<string, HashSet<string>> emailToPhoneDict = [];
Dictionary<string, HashSet<string>> emailToNameDict = [];

Dictionary<string, HashSet<string>> noEmailData = [];

foreach (var datum in data)
{
    if (datum.Item1 is null)
    {
        if (noEmailData.TryGetValue(datum.Item2!, out HashSet<string>? noEmailNameSet))
            noEmailNameSet.Add(datum.Item3);
        else
            noEmailData[datum.Item2!] = [datum.Item3];

        continue;
    }

    if (datum.Item2 is not null)
    {
        if (emailToPhoneDict.TryGetValue(datum.Item1, out HashSet<string>? emailPhoneSet))
            emailPhoneSet.Add(datum.Item2);
        else
            emailToPhoneDict[datum.Item1] = [datum.Item2];
    }

    if (emailToNameDict.TryGetValue(datum.Item1, out HashSet<string>? emailNameSet))
        emailNameSet.Add(datum.Item3);
    else
        emailToNameDict[datum.Item1] = [datum.Item3];
}

HashSet<(HashSet<string>, HashSet<string>)> groups = [];

foreach (var kv in emailToPhoneDict)
{
    (HashSet<string>, HashSet<string>) targetGroup;

    var groupsToBeMerged = groups.AsParallel().Where(g => g.Item2.Overlaps(kv.Value)).ToArray();

    if (groupsToBeMerged.Length == 0)
    {
        groups.Add(([kv.Key], kv.Value));
        continue;
    }
    else if (groupsToBeMerged.Length == 1)
        targetGroup = groupsToBeMerged.Single();
    else
    {
        groups.ExceptWith(groupsToBeMerged);

        targetGroup = ([], []);
        foreach (var grp in groupsToBeMerged)
        {
            targetGroup.Item1.UnionWith(grp.Item1);
            targetGroup.Item2.UnionWith(grp.Item2);

            grp.Item1.Clear();
            grp.Item2.Clear();
        }

        groups.Add(targetGroup);
    }

    targetGroup.Item1.Add(kv.Key);
    targetGroup.Item2.UnionWith(kv.Value);
    kv.Value.Clear();
}

emailToPhoneDict.Clear();

var result = groups.Select(g => new { Emails = g.Item1, Phones = g.Item2, Names = new HashSet<string>() }).ToList();

foreach (var kv in emailToNameDict)
{
    var g = result.AsParallel().Where(r => r.Emails.Contains(kv.Key)).SingleOrDefault();
    if (g is not null)
    {
        g.Names.UnionWith(kv.Value);
        kv.Value.Clear();
    }
    else
        result.Add(new { Emails = new HashSet<string>([kv.Key]), Phones = new HashSet<string>(), Names = kv.Value });
}

emailToNameDict.Clear();

foreach (var kv in noEmailData)
{
    var g = result.AsParallel().Where(r => r.Phones.Contains(kv.Key)).SingleOrDefault();
    if (g is not null)
    {
        g.Names.UnionWith(kv.Value);
        kv.Value.Clear();
    }
    else
        result.Add(new { Emails = new HashSet<string>(), Phones = new HashSet<string>([kv.Key]), Names = kv.Value });
}

noEmailData.Clear();

return;

3 Comments

Might be good to add some explanation to the naive code
Have you tried that on OP's input data and got the expected groupings?
yup. it works.
0

With double lookup idea of Good Night Nerd Pride, I wrote new solution which is way way faster than previous entries, with much simpler code(than my previous one).

I tried 1,000,000 random entities(which is x10 more entities than previous code) and it ran successfully in 3m40s of time and 580MB of memory. It is faster than my previous code in a level of exponent.

using System.Text;

string[] emailProviders = ["gmail.com", "outlook.com", "icloud.com", "yahoo.com", "company.mail"];
char[] validChars = [.. "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"];

Random rand = new();

IEnumerable<string> GetRandEmails()
{
    StringBuilder sb = new();

    while (true)
    {
        sb.Clear();
        foreach (var ch in Enumerable.Repeat(validChars, rand.Next(10, 20)).Select(arr => arr[rand.Next(arr.Length)]))
            sb.Append(ch);
        sb.Append('@');
        sb.Append(emailProviders[rand.Next(emailProviders.Length)]);
        yield return sb.ToString();
    }
}

string[] emails = [.. GetRandEmails().Take(1000000)];
string[] phones = [.. Enumerable.Repeat("0123456789".ToArray(), 362880).Select(s => { rand.Shuffle(s); return new string(s); })];
Customer[] data = [.. Enumerable.Repeat(string.Empty, 1000000)
    .Select(_ => rand.Next(5))
    .Select(i => new Customer(
        i == 0 ? null : emails[rand.Next(emails.Length)],
        i == 1 ? null : phones[rand.Next(phones.Length)],
        new string([.. Enumerable.Repeat(validChars, rand.Next(10, 20)).Select(arr => arr[rand.Next(arr.Length)])])
    ))];

/*
Customer[] data = [
    new(null,"0400111222","Taylor Doe"),
    new("[email protected]",null,"Doe"),
    new("[email protected]","0400999888","John Smith"),
    new("[email protected]","0411222333","Taylor Doe"),
    new("[email protected]","0400999887","Joh Smith"),
    new("[email protected]","0400111222",""),
    new("[email protected]","0411222333","Taylor Doe"),
    new("[email protected]","0400789789","Jane Johnson"),
];
*/

HashSet<Customer> dataSet = [.. data];

var emailLookUp = data.Where(d => d.Email is not null).ToLookup(d => d.Email!);
var phoneLookUp = data.Where(d => d.Phone is not null).ToLookup(d => d.Phone!);

List<HashSet<Customer>> groups = [];

while(dataSet.Count > 0)
{
    var d = dataSet.First();
    dataSet.Remove(d);

    HashSet<Customer> group = [d];
    groups.Add(group);

    HashSet<string> groupEmails = [];
    if (d.Email is not null)
        groupEmails.Add(d.Email);

    HashSet<string> groupPhones = [];
    if (d.Phone is not null)
        groupPhones.Add(d.Phone);

    List<string> emailsNextIter = [.. groupEmails];
    List<string> phonesNextIter = [.. groupPhones];

    while (emailsNextIter.Count > 0 || phonesNextIter.Count > 0)
    {
        List<string> newEmails = [];
        List<string> newPhones = [];

        foreach (string email in emailsNextIter)
        {
            foreach (Customer customer in emailLookUp[email])
            {
                if (dataSet.Remove(customer) && group.Add(customer) && customer.Phone is not null && groupPhones.Add(customer.Phone))
                    newPhones.Add(customer.Phone);
            }
        }

        foreach (string phone in phonesNextIter)
        {
            foreach (Customer customer in phoneLookUp[phone])
            {
                if (dataSet.Remove(customer) && group.Add(customer) && customer.Email is not null && groupEmails.Add(customer.Email))
                    newEmails.Add(customer.Email);
            }
        }

        emailsNextIter = newEmails;
        phonesNextIter = newPhones;
    }
}

return;

readonly record struct Customer(string? Email, string? Phone, string Name);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.