4

I am developing a system that processes sequential files generated by Cobol systems, currently, I am doing the data processing using several substrings to get the data, but I wonder if there is a more efficient way to process the file than to make several substrings...

At now, I do basically:

using (var sr = new StreamReader("file.txt"))
{
    String line = "";
    while(!sr.EndOfStream)
    {
        line = sr.ReadLine();
        switch(line[0])
        {
            case '0':
                processType0(line);
                break;
            case '1':
                processType1(line);
                break;
            case '2':
                processType2(line);
                break;
            case '9':
                processType9(line);
                break;
        }
    }
}

private void processType0(string line)
{
    type = line.Substring(0, 15);
    name = line.Substring(15, 30);
    //... and more 20 substrings
}

private void processType1(string line)
{
    // 45 substrings...
}

The file size may vary between 50mb and 150mb... A small example of the file:

01ARQUIVO01CIVDSUQK       00000000000000999999NAME NAME NAME NAME           892DATAFILE       200616        KY0000853                                                                                                                                                                                                                                                                                     000001
1000000000000000000000000999904202589ESMSS59365        00000010000000000000026171900000000002            0  01000000000001071600000099740150000000001N020516000000000000000000000000000000000000000000000000000000000000009800000000000000909999-AAAAAAAAAAAAAAAAAAAAAAAAA                                                            00000000                                                            000002
1000000000000000000000000861504202589ENJNS63198        00000010000000000000036171300000000002            0  01000000000001071600000081362920000000001N020516000000000000000000000000000000000000000000000000000000000000009800000000000000909999-BBBBBBBBBBBBBBBBBBBBBBBBBB                                                           00000000                                                            000003
9                                                                                                                                                                                                                                                                                                                                                                                                         000004
10
  • 5
    Efficient? As in the code runs faster? Or the actual process of writing the code is more efficient? Commented Jun 20, 2016 at 14:31
  • 2
    Haven't tried this myself, but try this stackoverflow.com/a/20803/1105235 Commented Jun 20, 2016 at 14:33
  • 2
    @TazbirBhuiyan any string manipulation generates unnecessary temporary strings. Besides, in fixed-width formats whitespace is significant Commented Jun 20, 2016 at 14:41
  • 2
    @Alexandre are you looking for efficient code performance, or efficient code writing process here? Or both? Commented Jun 20, 2016 at 15:30
  • 2
    Your records look to be fixed-length. Presumably C# has some type of "structure" which maps data? Search-engine seems to think so. Commented Jun 20, 2016 at 16:03

4 Answers 4

2

Frequent disk reads will slow down your code.

According to MSDN, the buffer size for the constructor you are using is 1024 bytes. Set a larger buffer size using a different constructor:

int bufferSize = 1024 * 128;

using (var reader = new StreamReader(path, encoding, autoDetectEncoding, bufferSize))
{
   ...
}

C# prioritizes safety over speed, so all String functions generate a new string.

Do you really need all of those substrings? If not, then just generate the ones you need:

private static string GetType(string line)
{
    return line.Substring(0, 15);
}

if (needed)
    type = GetLine(line);
Sign up to request clarification or add additional context in comments.

1 Comment

In my experience this usually makes very little difference as the disk subsystem is normally fairly well buffered before the data even gets to the stream reader code. But it is certainly worth a try.
1

You could try writing a parser which processes the file one character at a time.

I read a good article titled 'Writing a parser for CSV data' on how to do this with CSV files the other day, though the principals are the same for most file types. This can be found here http://www.boyet.com/articles/csvparser.html

4 Comments

The fields are fixed starting positions, and fixed width. Where does parsing of a CSV come into it?
CSV is not the important thing here, it is the use of a Parser that is the important thing. The article I referenced just happens to use CSV as an example to demonstrate the principals of parsing.
You mean "parse sourcefield namea (length) nameb (length) namec (length) with some optional displacements/offsets? At then end that is the same as has been started out with, just a one-(long)-liner. I'm really not sure where you are going with this, but you have adherents :-)
The alternative being presented is to copy a row from the stream to new string, copy field 1 to new string, copy field 2 to new string, copy field 3 to new string, etc, truncate white space on filed 1 copying to new string as we go, truncate white space on field 2 copying to a new string as we go, truncate white space on field 3 copying to new string as we go. That is a lot of memory allocations and string copy operations. I admit however, that the performance gain by using a parser type structure is likely to be small.
1

First time with C# but I think you want to look at something like

struct typeOne {
    fixed byte recordType[1];
    fixed byte whatThisFieldIsCalled[10];
    fixed byte someOtherFieldName[5];
    ...
}

And then just assign different structs by line[0] case. Or, knowing next to nada about C# that could be in the completely wrong ballpark and end up being a poor performer internally.

Comments

0

I love Linq

IEnumerable<string> ReadFile(string path)
{
 using (var reader = new StreamReader(path))
  {
    while (!reader.EndOfStream)
    {
     yield return reader.ReadLine();
    }
  }
}


void DoThing() 
{
  var myMethods = new Action<string>[] 
    { 
      s => 
         {
           //Process 0            
           type = line.Substring(0, 15);
           name = line.Substring(15, 30);
           //... and more 20 substrings
         },
      s => 
         {
           //Process 1            
           type = line.Substring(0, 15);
           name = line.Substring(15, 30);
           //... and more 20 substrings
         },
            //...

    }

var actions = ReadFile(@"c:\path\to\file.txt")
      .Select(line => new Action( () => myMethods[int.Parse(line[0])]() ))
      .ToArray();

 actions.ForEach(a => a.Invoke());
}

4 Comments

This won't imporve performance at all. In any case, File.ReadLines does the same as ReadFIle here
@PanagiotisKanavos Is the asker looking for the algo-optimization only?
When reading 150MB of data, the question isn't about algorithms at all, it's about speed and memory. BTW an array/dictionary of Regex objects indexed by the first character would really help. In any case though, the OP is asking about COBOL file parsing. I suspect there is at least one library for this
@PanagiotisKanavos I agree with you. If you think this answer adds any value or might be beneficial for future readers without confusing them. I will keep it. Otherwise I am willing to delete it. Your call

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.