0

I have a text file in the format:

number tab word tab word tab junk
number tab word tab word tab junk
number tab word tab word tab junk
number tab word tab word tab junk
number tab word tab word tab junk

For each line I'd like to put the number in a uint32_t, then the two words into strings and then ignore the rest of the line. I could do this by loading the file into memory and then working through it a byte at a time, but I'm convinced that a lovely regex could do it for me. Any ideas?

I'm working in C++ using #include in Xcode - this is a commandline tool so there's no real output, I'm just storing the data to compare with other data.

6
  • as you mention c++ and regex is this a c++11 question or do you use some third-party library? Commented Apr 28, 2014 at 14:43
  • 1
    why not just file >> num >> word1 >> word2? Then you can either read the junk in and ignore it, or use .Ignore()... Commented Apr 28, 2014 at 14:44
  • post an example of the desired output. Commented Apr 28, 2014 at 14:46
  • I've updated the question - I'll have a go at file >> num >> word1 >> word2 - clever stuff! Commented Apr 28, 2014 at 14:52
  • There's still outputs! Just because you're not sending stuff to std::out doesn't mean that you're not expecting the data in a certain format, outputting it, storing it somewhere. These can all be thought of as outputs. Your output in this case is the data that you're storing, and the format that you'd like it in... Commented Apr 28, 2014 at 15:08

2 Answers 2

1
extern bool DoStuff(unsigned n, 
                    const std::string &s0, 
                    const std::string &s1);

bool ProcessFile(const std::string &sFileName)
{
    std::ifstream ifs(sFileName);
    if (!ifs)
        return false;

    while (ifs)
    {
        unsigned n;
        std::string s0, s1;
        ifs >> n >> s0 >> s1;
        if (ifs.bad() || !DoStuff(n, s0, s1))
            return false;
        ifs.ignore(std::numeric_limits<int>::max(), '\n');
    }
    return !ifs.bad();
}
Sign up to request clarification or add additional context in comments.

Comments

1

Matt, you can use this simple regex:

(?im)^(\d+)\t([a-z]+)\t([a-z]+)

It captures the number in Group 1, the first word in Group 2, and the second word in Group 3.

To retrieve them from Groups 1, 2 and 3, I am not sure of your the exact C++ syntax, but this code stub give one idea of how to iterate over the matches and groups. Note that in this case we don't care about the overall matches, just the capturing groups.

try {
    TRegEx RegEx("(?im)^(\\d+)\t([a-z]+)\t([a-z]+)", TRegExOptions() << roIgnoreCase << roMultiLine);
    TMatch Match = RegEx.Match(SubjectString);
    while (Match.Success) {
        for (int i = 1; i < Match.Groups.Count; i++) {
            TGroup Group = Match.Groups[i];
            if (Group.Success) {
                // matched text: Group.Value
                // match start: Group.Index
                // match length: Group.Length
            } 
        }
        Match = Match.NextMatch();
    } 
} catch (ERegularExpressionError *ex) {
    // Syntax error in the regular expression
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.