How do I find and list duplicate rows based on columns in a CSV file using C#. Matching/Grouping Rows.

Question

I converted an excel file into a CSV file. The file contains over 100k records. I'm wanting to search and return duplicate rows by searching the full name column. If the full name's match up I want the program to return the entire rows of the duplicates. I started with a code that returns a list of full names but that's about it.

I've listed the code that I have now below:

public static void readCells()
    {
        var dictionary = new Dictionary<string, int>();

        Console.WriteLine("started");
        var counter = 1;
        var readText = File.ReadAllLines(path);
        var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);

        foreach (var s in readText)
        {
            var values = s.Split(new Char[] { ',' });
            var fullName = values[3];
            if (!dictionary.ContainsKey(fullName))
            {
                dictionary.Add(fullName, 1);
            }
            else
            {
                dictionary[fullName] += 1;

            }
            Console.WriteLine("Full Name Is: " + values[3]);

            counter++;
        }

    }
}

Can you extend this to a minimal reproducible example by including a CSV sample? — dbc
– dbc, Commented Dec 15, 2017 at 16:49

jdweng · Accepted Answer · 2017-12-15 17:02:21Z

I changed dictionary to use fullname as key :

       public static void readCells()
        {
            var dictionary = new Dictionary<string, List<List<string>>>();

            Console.WriteLine("started");
            var counter = 1;
            var readText = File.ReadAllLines(path);
            var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);

            foreach (var s in readText)
            {
                List<string> values = s.Split(new Char[] { ',' }).ToList();
                string fullName = values[3];
                if (!dictionary.ContainsKey(fullName))
                {
                    List<List<string>> newList = new List<List<string>>();
                    newList.Add(values);
                    dictionary.Add(fullName, newList);
                }
                else
                {
                    dictionary[fullName].Add(values);

                }
                Console.WriteLine("Full Name Is: " + values[3]);

                counter++;
            }

        }

dbc · Accepted Answer · 2017-12-15 19:07:54Z

I've found that using Microsoft's built-in TextFieldParser (which you can use in c# despite being in the Microsoft.VisualBasic.FileIO namespace) can simplify reading and parsing of CSV files.

Using this type, your method ReadCells() can be modified into the following extension method:

using Microsoft.VisualBasic.FileIO;

public static class TextFieldParserExtensions
{
    public static List<IGrouping<string, string[]>> ReadCellsWithDuplicatedCellValues(string path, int keyCellIndex, int nRowsToSkip /* = 0 */)
    {
        using (var stream = File.OpenRead(path))
        using (var parser = new TextFieldParser(stream))
        {
            parser.SetDelimiters(new string[] { "," });
            var values = parser.ReadAllFields()
                // If your CSV file contains header row(s) you can skip them by passing a value for nRowsToSkip
                .Skip(nRowsToSkip) 
                .GroupBy(row => row.ElementAtOrDefault(keyCellIndex))
                .Where(g => g.Count() > 1)
                .ToList();
            return values;
        }
    }

    public static IEnumerable<string[]> ReadAllFields(this TextFieldParser parser)
    {
        if (parser == null)
            throw new ArgumentNullException();
        while (!parser.EndOfData)
            yield return parser.ReadFields();
    }
}

Which you would call like:

var groups = TextFieldParserExtensions.ReadCellsWithDuplicatedCellValues(path, 3);

Notes:

TextFieldParser correctly handles cells with escaped, embedded commas which s.Split(new Char[] { ',' }) will not.
Since your CSV file has over 100k records I adopted a streaming strategy to avoid the intermediate string[] readText memory allocation.

Cinchoo · Accepted Answer · 2017-12-16 02:03:52Z

You can try out Cinchoo ETL - an open source library to parse CSV file and identify the duplicates with few lines of code.

Sample CSV file (EmpDuplicates.csv) below

Id,Name
1,Tom
2,Mark
3,Lou
3,Lou
4,Austin
4,Austin
4,Austin

Here is how you can parse and identify the duplicate records

using (var parser = new ChoCSVReader("EmpDuplicates.csv").WithFirstLineHeader())
{
    foreach (dynamic c in parser.GroupBy(r => r.Id).Where(g => g.Count() > 1).Select(g => g.FirstOrDefault()))
        Console.WriteLine(c.DumpAsJson());
}

Output:

{
  "Id": 3,
  "Name": "Lou"
}
{
  "Id": 4,
  "Name": "Austin"
}

Hope this helps.

For more detailed usage of this library, visit CodeProject article at https://www.codeproject.com/Articles/1145337/Cinchoo-ETL-CSV-Reader

Collectives™ on Stack Overflow

How do I find and list duplicate rows based on columns in a CSV file using C#. Matching/Grouping Rows.

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related