For my current project in C#, I am tasked with fetching customer details from a data source, 'cleansing' said customers (making sure the name is capitalised correctly, mobile formatted correctly, etc.), and then finding duplicate contacts and grouping them together. After grouping, the data is sent to another source where it handles merging duplicate customers.
I have successfully fetched and cleansed customers, but am now stuck on how I should be finding and grouping duplicates. Duplicates are defined as customer objects that have either:
- the same mobile, OR
- the same email
For instance, take the following dummy data (which is stored in a List<Customer> and each object is a Customer object):
Object 1
FName: Taylor
LName: Doe
Email: (empty)
Mobile: 0400111222
Object 2
FName: (empty)
LName: Doe
Email: [email protected]
Mobile: (empty)
Object 3
FName: John
LName: Smith
Email: [email protected]
Mobile: 0400999888
Object 4
FName: Taylor
LName: Doe
Email: [email protected]
Mobile: 0411222333
Object 5
FName: Joh
LName: Smith
Email: [email protected]
Mobile: 0400999887
Object 6
FName: (empty)
LName: (empty)
Email: [email protected]
Mobile: 0400111222
Object 7
FName: Taylor
LName: Doe
Email: [email protected]
Mobile: 0411222333
Object 8
FName: Jane
LName: Johnson
Email: [email protected]
Mobile: 0400789789
The algorithm should then group objects 1, 2, 4, 6 and 7 together as there is a common email OR common mobile that links these together. Objects 3 and 5 should be grouped together, as there is a common email, and object 8 had no duplicates found.
All other parameters are ignored, such as first name and last name, as from looking at the data, it is quite common that customers misspell their details. Originally, the system in which this data is stored did not force customers to enter information, so some contacts have missing names, emails or mobiles (which is why we are looking at mobile OR email, instead of mobile AND email). At some point this was updated to ensure both fields are populated, but on older customer contacts, the data is still missing.
What is the best way to approach this? It's important to note that the initial pull of data from the source will generate around 46,000+ customers which need to be sorted through.
The end format of the grouping is currently undefined - whether it's a list of list of objects, hashset, dictionary, etc. it doesn't really matter. Whatever algorithm ends up working we can adjust other processes to read the result.
Based on some quick research, I think hierarchical clustering may be an option... Could anyone potentially provide some insight as where it might be best to start? Just feeling a bit lost as to which direction to look in.
(empty)or just "" (the empty string)?