1

I have a bunch of text files that has a custom format, looking like this:

App Name    
Export Layout

Produced at 24/07/2011 09:53:21


Field Name                             Length                                                       

NAME                                   100                                                           
FULLNAME1                              150                                                           
ADDR1                                  80                                                           
ADDR2                                  80          

Any whitespaces may be tabs or spaces. The file may contain any number of field names and lengths.

I want to get all the field names and their corresponding field lengths and perhaps store them in a dictionary. This information will be used to process a corresponding fixed width data file having the mentioned field names and field lengths.

I know how to skip lines using ReadLine(). What I don't know is how to say: "When you reach the line that starts with 'Field Name', skip one more line, then starting from the next line, grab all the words on the left column and the numbers on the right column."

I have tried String.Trim() but that doesn't remove the whitespaces in between.

Thanks in advance.

3
  • 1
    Google "recursive descent parsing". You don't have a regular grammar, so grammar-driven parsing tools will not be likely to help. Commented Jul 24, 2014 at 8:20
  • Is the position of the line with Field Name fixed? Commented Jul 24, 2014 at 8:21
  • @shree.pat18 I would assume so. Commented Jul 24, 2014 at 10:18

3 Answers 3

6

You can use SkipWhile(l => !l.TrimStart().StartsWith("Field Name")).Skip(1):

Dictionary<string, string> allFieldLengths = File.ReadLines("path")
    .SkipWhile(l => !l.TrimStart().StartsWith("Field Name")) // skips lines that don't start with "Field Name"
    .Skip(1)                                       // go to next line
    .SkipWhile(l => string.IsNullOrWhiteSpace(l))  // skip following empty line(s)
    .Select(l =>                                   
    {                                              // anonymous method to use "real code"
        var line = l.Trim();                       // remove spaces or tabs from start and end of line
        string[] token = line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries);
        return new { line, token };                // return anonymous type from 
    })
    .Where(x => x.token.Length == 2)               // ignore all lines with more than two fields (invalid data)
    .Select(x => new { FieldName = x.token[0], Length = x.token[1] })
    .GroupBy(x => x.FieldName)                     // groups lines by FieldName, every group contains it's Key + all anonymous types which belong to this group
    .ToDictionary(xg => xg.Key, xg => string.Join(",", xg.Select(x => x.Length)));

line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries) will split by space and tabs and ignores all empty spaces. Use GroupBy to ensure that all keys are unique in the dictionary. In the case of duplicate field-names the Length will be joined with comma.


Edit: since you have requested a non-LINQ version, here is it:

Dictionary<string, string> allFieldLengths = new Dictionary<string, string>();
bool headerFound = false;
bool dataFound = false;
foreach (string l in File.ReadLines("path"))
{
    string line = l.Trim();
    if (!headerFound && line.StartsWith("Field Name"))
    {
        headerFound = true;
        // skip this line:
        continue;
    }
    if (!headerFound)
        continue;
    if (!dataFound && line.Length > 0)
        dataFound = true;
    if (!dataFound)
        continue;
    string[] token = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
    if (token.Length != 2)
        continue;
    string fieldName = token[0];
    string length = token[1];
    string lengthInDict;
    if (allFieldLengths.TryGetValue(fieldName, out lengthInDict))
        // append this length
        allFieldLengths[fieldName] = lengthInDict + "," + length;
    else
        allFieldLengths.Add(fieldName, length);
}

I like the LINQ version more because it's much more readable and maintainable (imo).

Sign up to request clarification or add additional context in comments.

10 Comments

@Terribad: i've added some inline code comments, i hope that it provides a sufficient explanation. Otherwise say what you don't understand.
I'm unfamiliar with LINQ, and that looks like a LOT of LINQ :P so I'm wondering if I can do this using line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries) but without the LINQ stuff.
@Terribad: however, i've provided also a non-linq version :)
@Terribad: note that the non-LINQ version also uses File.ReadAllLines instead of File.ReadLines to support a for-loop. The disadvantage is that it needs to load all into memory before it can start processing as opposed to ReadLines. Maybe a foreach is sufficient, then you can also use ReadLines with the non-LINQ way. Edit i've tested it, you can use the foreach + File.ReadLines. Changed the code above.
@Terribad: i've just tried to "translate" your requirement into code. If you can make fix rules that can help to simplify the logic, of course. But do you really want to rely on the line number only? Use File.ReadLines("").Skip(8) to read all lines starting from the 9th.
|
1

Based on the assumption that the position of the header line is fixed, we may consider actual key-value pairs to start from the 9th line. Then, using the ReadAllLines method to return a String array from the file, we just start processing from index 8 onwards:

  string[] lines = File.ReadAllLines(filepath);
  Dictionary<string,int> pairs = new Dictionary<string,int>();

    for(int i=8;i<lines.Length;i++)
    {
        string[] pair = Regex.Replace(lines[i],"(\\s)+",";").Split(';');
        pairs.Add(pair[0],int.Parse(pair[1]));
    }

This is a skeleton, not accounting for exception handling, but I guess it should get you started.

Comments

0

You can use String.StartsWith() to detect "FieldName". Then String.Split() with a parameter of null to split by whitespace. This will get you your fieldname and length strings.

1 Comment

I tried this and it also gets all the whitespaces in between the two columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.