Parsing a text file with a custom format in C#

Question

I have a bunch of text files that has a custom format, looking like this:

App Name    
Export Layout

Produced at 24/07/2011 09:53:21


Field Name                             Length                                                       

NAME                                   100                                                           
FULLNAME1                              150                                                           
ADDR1                                  80                                                           
ADDR2                                  80

Any whitespaces may be tabs or spaces. The file may contain any number of field names and lengths.

I want to get all the field names and their corresponding field lengths and perhaps store them in a dictionary. This information will be used to process a corresponding fixed width data file having the mentioned field names and field lengths.

I know how to skip lines using ReadLine(). What I don't know is how to say: "When you reach the line that starts with 'Field Name', skip one more line, then starting from the next line, grab all the words on the left column and the numbers on the right column."

I have tried String.Trim() but that doesn't remove the whitespaces in between.

Thanks in advance.

Google "recursive descent parsing". You don't have a regular grammar, so grammar-driven parsing tools will not be likely to help. — Pieter Geerkens
– Pieter Geerkens, Commented Jul 24, 2014 at 8:20

Tim Schmelter · Accepted Answer · 2014-07-24 09:41:45Z

6

You can use SkipWhile(l => !l.TrimStart().StartsWith("Field Name")).Skip(1):

Dictionary<string, string> allFieldLengths = File.ReadLines("path")
    .SkipWhile(l => !l.TrimStart().StartsWith("Field Name")) // skips lines that don't start with "Field Name"
    .Skip(1)                                       // go to next line
    .SkipWhile(l => string.IsNullOrWhiteSpace(l))  // skip following empty line(s)
    .Select(l =>                                   
    {                                              // anonymous method to use "real code"
        var line = l.Trim();                       // remove spaces or tabs from start and end of line
        string[] token = line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries);
        return new { line, token };                // return anonymous type from 
    })
    .Where(x => x.token.Length == 2)               // ignore all lines with more than two fields (invalid data)
    .Select(x => new { FieldName = x.token[0], Length = x.token[1] })
    .GroupBy(x => x.FieldName)                     // groups lines by FieldName, every group contains it's Key + all anonymous types which belong to this group
    .ToDictionary(xg => xg.Key, xg => string.Join(",", xg.Select(x => x.Length)));

line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries) will split by space and tabs and ignores all empty spaces. Use GroupBy to ensure that all keys are unique in the dictionary. In the case of duplicate field-names the Length will be joined with comma.

Edit: since you have requested a non-LINQ version, here is it:

Dictionary<string, string> allFieldLengths = new Dictionary<string, string>();
bool headerFound = false;
bool dataFound = false;
foreach (string l in File.ReadLines("path"))
{
    string line = l.Trim();
    if (!headerFound && line.StartsWith("Field Name"))
    {
        headerFound = true;
        // skip this line:
        continue;
    }
    if (!headerFound)
        continue;
    if (!dataFound && line.Length > 0)
        dataFound = true;
    if (!dataFound)
        continue;
    string[] token = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
    if (token.Length != 2)
        continue;
    string fieldName = token[0];
    string length = token[1];
    string lengthInDict;
    if (allFieldLengths.TryGetValue(fieldName, out lengthInDict))
        // append this length
        allFieldLengths[fieldName] = lengthInDict + "," + length;
    else
        allFieldLengths.Add(fieldName, length);
}

I like the LINQ version more because it's much more readable and maintainable (imo).

edited Jul 24, 2014 at 9:41

answered Jul 24, 2014 at 8:24

Tim Schmelter

462k79 gold badges719 silver badges980 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Tim Schmelter Over a year ago

@Terribad: i've added some inline code comments, i hope that it provides a sufficient explanation. Otherwise say what you don't understand.

InvalidBrainException Over a year ago

I'm unfamiliar with LINQ, and that looks like a LOT of LINQ :P so I'm wondering if I can do this using line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries) but without the LINQ stuff.

Tim Schmelter Over a year ago

@Terribad: however, i've provided also a non-linq version :)

Tim Schmelter Over a year ago

@Terribad: note that the non-LINQ version also uses File.ReadAllLines instead of File.ReadLines to support a for-loop. The disadvantage is that it needs to load all into memory before it can start processing as opposed to ReadLines. Maybe a foreach is sufficient, then you can also use ReadLines with the non-LINQ way. Edit i've tested it, you can use the foreach + File.ReadLines. Changed the code above.

Tim Schmelter Over a year ago

@Terribad: i've just tried to "translate" your requirement into code. If you can make fix rules that can help to simplify the logic, of course. But do you really want to rely on the line number only? Use File.ReadLines("").Skip(8) to read all lines starting from the 9th.

|

shree.pat18 · Accepted Answer · 2014-07-24 10:38:57Z

1

Based on the assumption that the position of the header line is fixed, we may consider actual key-value pairs to start from the 9th line. Then, using the ReadAllLines method to return a String array from the file, we just start processing from index 8 onwards:

  string[] lines = File.ReadAllLines(filepath);
  Dictionary<string,int> pairs = new Dictionary<string,int>();

    for(int i=8;i<lines.Length;i++)
    {
        string[] pair = Regex.Replace(lines[i],"(\\s)+",";").Split(';');
        pairs.Add(pair[0],int.Parse(pair[1]));
    }

This is a skeleton, not accounting for exception handling, but I guess it should get you started.

answered Jul 24, 2014 at 10:38

shree.pat18

21.8k3 gold badges45 silver badges65 bronze badges

Comments

GazTheDestroyer · Accepted Answer · 2014-07-24 08:24:43Z

0

You can use String.StartsWith() to detect "FieldName". Then String.Split() with a parameter of null to split by whitespace. This will get you your fieldname and length strings.

answered Jul 24, 2014 at 8:24

GazTheDestroyer

21.3k10 gold badges76 silver badges109 bronze badges

1 Comment

InvalidBrainException Over a year ago

I tried this and it also gets all the whitespaces in between the two columns.

Collectives™ on Stack Overflow

Parsing a text file with a custom format in C#

3 Answers 3

10 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

10 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related