0

I converted an excel file into a CSV file. The file contains over 100k records. I'm wanting to search and return duplicate rows by searching the full name column. If the full name's match up I want the program to return the entire rows of the duplicates. I started with a code that returns a list of full names but that's about it.

I've listed the code that I have now below:

public static void readCells()
    {
        var dictionary = new Dictionary<string, int>();

        Console.WriteLine("started");
        var counter = 1;
        var readText = File.ReadAllLines(path);
        var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);

        foreach (var s in readText)
        {
            var values = s.Split(new Char[] { ',' });
            var fullName = values[3];
            if (!dictionary.ContainsKey(fullName))
            {
                dictionary.Add(fullName, 1);
            }
            else
            {
                dictionary[fullName] += 1;

            }
            Console.WriteLine("Full Name Is: " + values[3]);

            counter++;
        }

    }
}
2
  • I don't see a question in any of this. Commented Dec 15, 2017 at 16:21
  • Can you extend this to a minimal reproducible example by including a CSV sample? Commented Dec 15, 2017 at 16:49

3 Answers 3

1

I changed dictionary to use fullname as key :

       public static void readCells()
        {
            var dictionary = new Dictionary<string, List<List<string>>>();

            Console.WriteLine("started");
            var counter = 1;
            var readText = File.ReadAllLines(path);
            var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);

            foreach (var s in readText)
            {
                List<string> values = s.Split(new Char[] { ',' }).ToList();
                string fullName = values[3];
                if (!dictionary.ContainsKey(fullName))
                {
                    List<List<string>> newList = new List<List<string>>();
                    newList.Add(values);
                    dictionary.Add(fullName, newList);
                }
                else
                {
                    dictionary[fullName].Add(values);

                }
                Console.WriteLine("Full Name Is: " + values[3]);

                counter++;
            }

        }
Sign up to request clarification or add additional context in comments.

Comments

1

I've found that using Microsoft's built-in TextFieldParser (which you can use in c# despite being in the Microsoft.VisualBasic.FileIO namespace) can simplify reading and parsing of CSV files.

Using this type, your method ReadCells() can be modified into the following extension method:

using Microsoft.VisualBasic.FileIO;

public static class TextFieldParserExtensions
{
    public static List<IGrouping<string, string[]>> ReadCellsWithDuplicatedCellValues(string path, int keyCellIndex, int nRowsToSkip /* = 0 */)
    {
        using (var stream = File.OpenRead(path))
        using (var parser = new TextFieldParser(stream))
        {
            parser.SetDelimiters(new string[] { "," });
            var values = parser.ReadAllFields()
                // If your CSV file contains header row(s) you can skip them by passing a value for nRowsToSkip
                .Skip(nRowsToSkip) 
                .GroupBy(row => row.ElementAtOrDefault(keyCellIndex))
                .Where(g => g.Count() > 1)
                .ToList();
            return values;
        }
    }

    public static IEnumerable<string[]> ReadAllFields(this TextFieldParser parser)
    {
        if (parser == null)
            throw new ArgumentNullException();
        while (!parser.EndOfData)
            yield return parser.ReadFields();
    }
}

Which you would call like:

var groups = TextFieldParserExtensions.ReadCellsWithDuplicatedCellValues(path, 3);

Notes:

  • TextFieldParser correctly handles cells with escaped, embedded commas which s.Split(new Char[] { ',' }) will not.

  • Since your CSV file has over 100k records I adopted a streaming strategy to avoid the intermediate string[] readText memory allocation.

Comments

1

You can try out Cinchoo ETL - an open source library to parse CSV file and identify the duplicates with few lines of code.

Sample CSV file (EmpDuplicates.csv) below

Id,Name
1,Tom
2,Mark
3,Lou
3,Lou
4,Austin
4,Austin
4,Austin

Here is how you can parse and identify the duplicate records

using (var parser = new ChoCSVReader("EmpDuplicates.csv").WithFirstLineHeader())
{
    foreach (dynamic c in parser.GroupBy(r => r.Id).Where(g => g.Count() > 1).Select(g => g.FirstOrDefault()))
        Console.WriteLine(c.DumpAsJson());
}

Output:

{
  "Id": 3,
  "Name": "Lou"
}
{
  "Id": 4,
  "Name": "Austin"
}

Hope this helps.

For more detailed usage of this library, visit CodeProject article at https://www.codeproject.com/Articles/1145337/Cinchoo-ETL-CSV-Reader

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.