c# Remove duplicates algorithm LINQ

Question

I have a situation where I have a csv file as follows: Student Names, Address.

However, student names column could have duplicates so if that's the case i need to create a new file with only those duplicated student name and address - keep going until each file has no duplicated student names in a particular file.

Ie.

Student Names   Address
John            5 West st.
David           42 Alan st.
John            22 Dees st.
Smith           2 King st.
David           77 Jack st.
John            33 King st.

Should be divided into 3 files like so: 1st File:

Student Names   Address
John            5 West st.
David           42 Alan st.
Smith           2 King st.

2nd File:

Student Names   Address
John            22 Dees st.
David           77 Jack st.

3rd File:

Student Names   Address
John            33 King st.

My logic was to take the file put it into a DataTable and was to create a dictionary of Student Names -> Address -- However, Dictionary will not work because they keys are NOT unique. So my next logic was to create a list of Student Names and find out the duplicates from there and create a Datatable and create a file from there. I feel like this is more complicated as it is - Im pretty sure there must be an easier way in LiNQ - Could you guys help me out or shoot some pointers.

Thanks.

You're looking for a Lookup<Tkey,TValue>

Tim Schmelter
– Tim Schmelter

2015-09-18 15:24:12 +00:00
Commented Sep 18, 2015 at 15:24 — Tim Schmelter
– Tim Schmelter, Commented Sep 18, 2015 at 15:24

Sachin Kainth · Accepted Answer · 2015-09-18 15:22:55Z

2

The Dictionary approach is quite good actually. I would stick with it. Make the key of the dictionary, the names and the value the address. That way you will know how many files you need to create by finding the name with the most amount of addresses. The number of addresses will be the number of files you need to create.

Then go through the list of names and add them and the address to separate files in sequence. Then, once all names have been exhausted you are done.

In your example above you will have a Dictionary like this

John -> 5 West st., 22 Dees st., 33 King st.
David -> 42 Alan st., 77 Jack st.    
Smith -> 2 King st.

As @ric said this will be a Dictionary<string, List<string>>

edited Sep 18, 2015 at 15:22

answered Sep 18, 2015 at 15:19

Sachin Kainth

47.1k87 gold badges212 silver badges309 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ric Over a year ago

You mean Dictionary<string, List<string>>? I suppose you would have to get the max count from the list to work out the number of files you would need

Edgar Hernandez · Accepted Answer · 2015-09-18 15:25:51Z

1

Assuming that you have a class like

public class Student
{
    public string Name { get; set; }
    public string Address { get; set; }
}

In linq You can group the students by Names

 var students = LoadStudentsFromFile();
 var studentsByName = students.GroupBy(st => st.Name).ToDictionary(g => g.Key, g => g.ToList());

At this moment you will have a Dictionary with student names as keys and a list of students as values

John ->  [{Name: John, Address: 5 West st.}, {Name: John, Address: 22 Dees st.}, {Name: John, Address: 33 King st.}]
David -> [{Name: David, Address: 42 Alan st.}, {Name: David, Address: 277 Jack st.}]
...

Then you can iterate over the keys and take one from the end of each until empty the list and dictionary. Take from the end to avoid re-sizing of the list.

 while(studentsByName.Any())
 {
     var uniqueStudents = new List<Student>();
     foreach(var name in studentsByName.Keys)
     {
         uniqueStudents.Add(studentsByName[name].Last());
         studentsByName[name].RemoveAt(studentsByName[name].Count -1);
         if(studentsByName[name].Count == 0)
         {
             studentsByName.Remove(name);
         }
     }

     SaveListOfUniqueStudents(uniqueStudents);
 }

answered Sep 18, 2015 at 15:25

Edgar Hernandez

4,0301 gold badge26 silver badges27 bronze badges

4 Comments

civic.sir Over a year ago

Thats pretty clever but a little confusing on what this method returns: LoadStudentsFromFile() - Is it a datatable?

Robert McKee Over a year ago

@civic.sir It would be an IEnumerable<Student> (EX: Student[], List<Student>, IQueryable<Student> etc.)

Robert McKee Over a year ago

Here is a primitive example:

IEnumerable<Student> LoadStudentsFromFile(string path) { return File.ReadLines(path).Select(x=>{ var fields=x.Split(','); return new Student {Name=fields[0],Id=field[1]}); }

civic.sir Over a year ago

Great thanks a lot Limo and Robert. I implemented this algorithm few changes but it works now.. Thanks again

Robert McKee · Accepted Answer · 2015-09-18 17:26:39Z

Simple version, assuming the CSV's are simplistic, comma separated, and doesn't allow for the strings to be enclosed in double quotes, but can be extended if you need it to be:

IEnumerable<Student> LoadStudentsFromFile(string path)
{
  return File.ReadLines(path).Select(x=>{
    var fields=x.Split(','); 
    return new Student {Name=fields[0],Id=field[1]});
}
void SaveStudentsToFile(path,IEnumerable<Student> students)
{
  File.WriteAllLines(path,students);
}
var students=LoadStudentsFromFile("inputfile.csv");
var studentsByName = students.GroupBy(st => st.Name)
  .ToDictionary(g => g.Key, g => g.ToList());

var max=studentsByName.Max(x=>x.Value.Count());
for(var x=0;x<max;x++)
  SaveStudentsToFile("outfile"+x+".csv",
    studentsByName.Where(s=>s.Value.Count()>=x+1)
      .Select(s=>string.Format("{0},{1}",s.Key,s.Value.Skip(x).First)));

Mikey Mouse · Accepted Answer · 2015-09-18 15:13:21Z

0

I'd go with something like: Create a Class (StudentFileWriter) that holds a Writer for a CSV file and a List of the names in that file. Whenever you write to the file, add the name to the List.

Create a List of StudentFileWriters

Then read one line of your file at a time, check the first StudentFileWriter if its ListOfNames.Contains(string newNameToInsert) If true, go to the next one, if there isn't a new one, create one and Write to it's new file. If false, just Write to it's file.

You could probably write it in a big complex bit of Linq too with Groupings/Rankings, etc but this way it should be easy to debug and see what's going on.

answered Sep 18, 2015 at 15:13

Mikey Mouse

3,1483 gold badges29 silver badges44 bronze badges

Comments

Alexander · Accepted Answer · 2015-09-18 16:17:50Z

My idea is to create a list of dictionary. We have Student class (thx @LimoWanKenobi):

public class Student
{
    public string Name { get; set; }
    public string Address { get; set; }
}

Here is my method:

    IEnumerable<IEnumerable<Student>> Process(IEnumerable<Student> students)
    {
        var files = new List<Dictionary<string, Student>>();

        foreach (var student in students)
        {
            var isAdded = false;
            foreach (var file in files)
            {
                if (!file.ContainsKey(student.Name))
                {
                    file.Add(student.Name, student);
                    isAdded = true;
                    break;
                }
            }

            if (!isAdded)
            {
                files.Add(new Dictionary<string, Student>
                {
                    { student.Name, student }
                });
            }
        }

        return files.Select(kvp => kvp.Values);
    }

Collectives™ on Stack Overflow

c# Remove duplicates algorithm LINQ

5 Answers 5

1 Comment

4 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related