Efficient processing of sequential files C#

Question

I am developing a system that processes sequential files generated by Cobol systems, currently, I am doing the data processing using several substrings to get the data, but I wonder if there is a more efficient way to process the file than to make several substrings...

At now, I do basically:

using (var sr = new StreamReader("file.txt"))
{
    String line = "";
    while(!sr.EndOfStream)
    {
        line = sr.ReadLine();
        switch(line[0])
        {
            case '0':
                processType0(line);
                break;
            case '1':
                processType1(line);
                break;
            case '2':
                processType2(line);
                break;
            case '9':
                processType9(line);
                break;
        }
    }
}

private void processType0(string line)
{
    type = line.Substring(0, 15);
    name = line.Substring(15, 30);
    //... and more 20 substrings
}

private void processType1(string line)
{
    // 45 substrings...
}

The file size may vary between 50mb and 150mb... A small example of the file:

01ARQUIVO01CIVDSUQK       00000000000000999999NAME NAME NAME NAME           892DATAFILE       200616        KY0000853                                                                                                                                                                                                                                                                                     000001
1000000000000000000000000999904202589ESMSS59365        00000010000000000000026171900000000002            0  01000000000001071600000099740150000000001N020516000000000000000000000000000000000000000000000000000000000000009800000000000000909999-AAAAAAAAAAAAAAAAAAAAAAAAA                                                            00000000                                                            000002
1000000000000000000000000861504202589ENJNS63198        00000010000000000000036171300000000002            0  01000000000001071600000081362920000000001N020516000000000000000000000000000000000000000000000000000000000000009800000000000000909999-BBBBBBBBBBBBBBBBBBBBBBBBBB                                                           00000000                                                            000003
9                                                                                                                                                                                                                                                                                                                                                                                                         000004

Efficient? As in the code runs faster? Or the actual process of writing the code is more efficient? — Kaizen Programmer
– Kaizen Programmer, Commented Jun 20, 2016 at 14:31
Haven't tried this myself, but try this stackoverflow.com/a/20803/1105235 — rpeshkov
– rpeshkov, Commented Jun 20, 2016 at 14:33
@TazbirBhuiyan any string manipulation generates unnecessary temporary strings. Besides, in fixed-width formats whitespace is significant — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jun 20, 2016 at 14:41
@Alexandre are you looking for efficient code performance, or efficient code writing process here? Or both? — Kaizen Programmer
– Kaizen Programmer, Commented Jun 20, 2016 at 15:30
Your records look to be fixed-length. Presumably C# has some type of "structure" which maps data? Search-engine seems to think so. — Bill Woodger
– Bill Woodger, Commented Jun 20, 2016 at 16:03

score 2 · Accepted Answer · 2016-06-20 15:33:21Z

2

Frequent disk reads will slow down your code.

According to MSDN, the buffer size for the constructor you are using is 1024 bytes. Set a larger buffer size using a different constructor:

int bufferSize = 1024 * 128;

using (var reader = new StreamReader(path, encoding, autoDetectEncoding, bufferSize))
{
   ...
}

C# prioritizes safety over speed, so all String functions generate a new string.

Do you really need all of those substrings? If not, then just generate the ones you need:

private static string GetType(string line)
{
    return line.Substring(0, 15);
}

if (needed)
    type = GetLine(line);

edited Jun 20, 2016 at 15:33

answered Jun 20, 2016 at 14:57

user1023602

Sign up to request clarification or add additional context in comments.

1 Comment

Martin Brown Over a year ago

In my experience this usually makes very little difference as the disk subsystem is normally fairly well buffered before the data even gets to the stream reader code. But it is certainly worth a try.

Martin Brown · Accepted Answer · 2016-06-20 15:21:10Z

1

You could try writing a parser which processes the file one character at a time.

I read a good article titled 'Writing a parser for CSV data' on how to do this with CSV files the other day, though the principals are the same for most file types. This can be found here http://www.boyet.com/articles/csvparser.html

answered Jun 20, 2016 at 15:21

Martin Brown

25.5k16 gold badges88 silver badges134 bronze badges

4 Comments

Bill Woodger Over a year ago

The fields are fixed starting positions, and fixed width. Where does parsing of a CSV come into it?

Martin Brown Over a year ago

CSV is not the important thing here, it is the use of a Parser that is the important thing. The article I referenced just happens to use CSV as an example to demonstrate the principals of parsing.

Bill Woodger Over a year ago

You mean "parse sourcefield namea (length) nameb (length) namec (length) with some optional displacements/offsets? At then end that is the same as has been started out with, just a one-(long)-liner. I'm really not sure where you are going with this, but you have adherents :-)

Martin Brown Over a year ago

The alternative being presented is to copy a row from the stream to new string, copy field 1 to new string, copy field 2 to new string, copy field 3 to new string, etc, truncate white space on filed 1 copying to new string as we go, truncate white space on field 2 copying to a new string as we go, truncate white space on field 3 copying to new string as we go. That is a lot of memory allocations and string copy operations. I admit however, that the performance gain by using a parser type structure is likely to be small.

Brian Tiffin · Accepted Answer · 2016-06-26 21:54:29Z

1

First time with C# but I think you want to look at something like

struct typeOne {
    fixed byte recordType[1];
    fixed byte whatThisFieldIsCalled[10];
    fixed byte someOtherFieldName[5];
    ...
}

And then just assign different structs by line[0] case. Or, knowing next to nada about C# that could be in the completely wrong ballpark and end up being a poor performer internally.

answered Jun 26, 2016 at 21:54

Brian Tiffin

4,2011 gold badge30 silver badges35 bronze badges

Comments

fahadash · Accepted Answer · 2016-06-20 14:46:26Z

0

I love Linq

IEnumerable<string> ReadFile(string path)
{
 using (var reader = new StreamReader(path))
  {
    while (!reader.EndOfStream)
    {
     yield return reader.ReadLine();
    }
  }
}


void DoThing() 
{
  var myMethods = new Action<string>[] 
    { 
      s => 
         {
           //Process 0            
           type = line.Substring(0, 15);
           name = line.Substring(15, 30);
           //... and more 20 substrings
         },
      s => 
         {
           //Process 1            
           type = line.Substring(0, 15);
           name = line.Substring(15, 30);
           //... and more 20 substrings
         },
            //...

    }

var actions = ReadFile(@"c:\path\to\file.txt")
      .Select(line => new Action( () => myMethods[int.Parse(line[0])]() ))
      .ToArray();

 actions.ForEach(a => a.Invoke());
}

answered Jun 20, 2016 at 14:46

fahadash

3,3131 gold badge37 silver badges68 bronze badges

4 Comments

Panagiotis Kanavos Over a year ago

This won't imporve performance at all. In any case, File.ReadLines does the same as ReadFIle here

fahadash Over a year ago

@PanagiotisKanavos Is the asker looking for the algo-optimization only?

Panagiotis Kanavos Over a year ago

When reading 150MB of data, the question isn't about algorithms at all, it's about speed and memory. BTW an array/dictionary of Regex objects indexed by the first character would really help. In any case though, the OP is asking about COBOL file parsing. I suspect there is at least one library for this

fahadash Over a year ago

@PanagiotisKanavos I agree with you. If you think this answer adds any value or might be beneficial for future readers without confusing them. I will keep it. Otherwise I am willing to delete it. Your call

Collectives™ on Stack Overflow

Efficient processing of sequential files C#

4 Answers 4

1 Comment

4 Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related