Faster way to search string in a large csv file C#

Question

I am having

a DataTable (columns are AccId and TerrName) which contains more than 2000 rows.
a large csv file (columns are AccId and External_ID) containing more than 6 millions records.

Now, I need to match AccId and have to find its corresponding External_ID from the csv file.

Currently I am achieving it using below code:

DataTable tblATL = Util.GetTable("ATL", false);
tblATL.Columns.Add("External_ID");

DataTable tbl = Util.CsvToTable("TT.csv", true);

foreach (DataRow columnRow in tblATL.Rows)
{
    var query = tbl.Rows.Cast<DataRow>().FirstOrDefault(x => x.Field<string>("AccId") == columnRow["AccId"].ToString());
    if (query != null)
    {
        columnRow["External_ID"] = query.Field<string>("External_ID");
    }
    else
    {
        columnRow["External_ID"] = "New";
    }
}

This code is working well but only problem is a performance issue, its taking very very long time to get the result.

Please help. How can I improve its performance, do you have any other approach?

can you give example headers of the csv file? eg fieldnames, their order/ type etc (holding 6M records in memory will always be slower) — BugFinder
– BugFinder, Commented Jun 15, 2016 at 13:52
If you are loading the entirety of the csv file in memory, PLinq is always an option. — Irwene
– Irwene, Commented Jun 15, 2016 at 13:54
@BugFinder: All columns are of string type without a specific order. AccId,External_ID 001P000000eHknBIAS,303363IN 001U000001bU0Q6IAK,303063IN — Avijit
– Avijit, Commented Jun 15, 2016 at 14:02

Dmitrii Bychenko · Accepted Answer · 2016-06-15 15:02:27Z

3

I suggest organizing data into a dictionary, say, Dictionary<String, String[]> which has O(1) time complexity, e.g.

  Dictionary<String, String[]> Externals = File
    .ReadLines(@"C:\MyFile.csv")
    .Select(line => line.Split(',')) // the simplest, just to show the idea
    .ToDictionary(
      items => items[0], // let External_ID be the 1st column
      items => items // or whatever record representation
    );

  ....

  String externalId = ...

  String[] items = Externals[externalId];

EDIT: if same External_ID can appear more than once (see comments below) you have to deal with duplicates, e.g.

 var csv =  File
   .ReadLines(@"C:\MyFile.csv")
   .Select(line => line.Split(',')) // the simplest, just to show the idea

 Dictionary<String, String[]> Externals = new Dictionary<String, String[]>();

 foreach (var items in csv) {
   var key = items[0]; // let External_ID be the 1st column
   var value = items;  // or whatever record representation

   if (!Externals.ContainsKey(key)) 
     Externals.Add(key, value);
   // else {
   //   //TODO: implement, if you want to deal with duplicates in some other way 
   //}
 }

edited Jun 15, 2016 at 15:02

answered Jun 15, 2016 at 13:57

Dmitrii Bychenko

188k20 gold badges178 silver badges231 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Avijit Over a year ago

Let me implement it.

Avijit Over a year ago

Currently I am facing issue with data, file can contain duplicate AccId with different External_Id, need to consider first occurrence of it. Dictionary throwing exception due to duplicate key as expected.

Dmitrii Bychenko Over a year ago

@Avijit: in that case you have to deal with duplicates (see my edit) and the simplest ToDictionary() will not do.

Collectives™ on Stack Overflow

Faster way to search string in a large csv file C#

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related