1

I have a huge set of log lines and I need to parse each line (so efficiency is very important).

Each log line is of the form

cust_name time_start time_end (IP or URL )*

So ip address, time, time and a possibly empty list of ip addresses or urls separated by semicolons. If there is only ip or url in the last list there is no separator. If there is more than 1, then they are separated by semicolons.

I need a way to parse this line and read it into a data structure. time_start or time_end could be either system time or GMT. cust_name could also have multiple strings separated by spaces.

I can do this by reading character by character and essentially writing my own parser. Is there a better way to do this ?

2
  • Hmmm... can you guarantee that semi-colons don't appear in you urls? Or at least that they don't appear at he ends? Commented Mar 5, 2009 at 18:44
  • What's your goal? what are you going to do with the data after you parse it? Commented Mar 5, 2009 at 19:00

10 Answers 10

7

Maybe Boost RegExp lib will help you. http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/index.html

Sign up to request clarification or add additional context in comments.

2 Comments

I up-modded, but remember, "Those who attempt to solve a problem using regular expressions now have two problems."
:) nice quote. But anyway RegExp good solution for small or not significant tasks.
5

I've had success with Boost Tokenizer for this sort of thing. It helps you break an input stream into tokens with custom separators between the tokens.

Comments

4

Using regular expressions (boost::regex is a nice implementation for C++) you can easily separate different parts of your string - cust_name, time_start ... and find all that urls\ips

Second step is more detailed parsing of that groups if needed. Dates for example you can parse using boost::datetime library (writing custom parser if string format isn't standard).

Comments

3

Why do you want to do this in C++? It sounds like an obvious job for something like perl.

5 Comments

Sure. If he's just doing this job. But the context might be an existing code with some other primary task...
He's interested in performance, and a custom C++ parser will blow the doors off a Perl parser for speed of execution (but not speed of development).
David, that's not necessarily true. It can very easily backfire on him (in terms of performance) if he stores the resulting gigantic data structure in memory! C++ won't help there.
@david untrue - the regex engine in perl has had untold man years spent on it - you are very unlikely to do as good a job with hand-rolled C++ code
I am using C++ because this is part of a full application where the data structures I create are used by the rest of the app.
2

Consider using a Regular Expressions library...

1 Comment

And next thing you know, we have another how do I parse URLs question.
1

Custom input demands custom parser. Or, pray that there is an ideal world and errors don't exist. Specially, if you want to have efficiency. Posting some code may be of help.

Comments

1

for such a simple grammar you can use split, take a look at http://www.boost.org/doc/libs/1_38_0/doc/html/string_algo/usage.html#id4002194

Comments

1

UPDATE changed answer drastically!

I have a huge set of log lines and I need to parse each line (so efficiency is very important).

Just be aware that C++ won't help much in terms of efficiency in this situation. Don't be fooled into thinking that just because you have a fast parsing code in C++ that your program will have high performance!

The efficiency you really need here is not the performance at the "machine code" level of the parsing code, but at the overall algorithm level.

Think about what you're trying to do.
You have a huge text file, and you want to convert each line to a data structure,

Storing huge data structure in memory is very inefficient, no matter what language you're using!

What you need to do is "fetch" one line at a time, convert it to a data structure, and deal with it, then, and only after you're done with the data structure, you go and fetch the next line and convert it to a data structure, deal with it, and repeat.

If you do that, you've already solved the major bottleneck.

For parsing the line of text, it seems the format of your data is quite simplistic, check out a similar question that I asked a while ago: C++ string parsing (python style)

In your case, I suppose you could use a string stream, and use the >> operator to read the next "thing" in the line.

see this answer for example code.

Alternatively, (I didn't want to delete this part!!) If you could write this in python it will be much simpler. I don't know your situation (it seems you're stuck with C++), but still

Look at this presentation for doing these kinds of task efficiently using python generator expressions: http://www.dabeaz.com/generators/Generators.pdf

It's a worth while read. At slide 31 he deals with what seems to be something very similar to what you're trying to do.

It'll at least give you some inspiration.
It also demonstrates quite strongly that performance is gained not by the particular string-parsing code, but the over all algorithm.

2 Comments

I think you are conflating a good idea (Process one line at a time) with one that depends on the context (don't use c++ for this). Moreover, the OP notes in the comments to another answer that he's doing this in an existing c++ code. Nonetheless, +1 for the one-at-a-time point.
good point! I changed the answer. but in my defense though, he mentioned the existing C++ app quite a while after I posted my answer
0

You could try to use a simple lex/yacc|flex/bison vocabulary to parse this kind of input.

Comments

0

The parser you need sounds really simple. Take a look at this. Any compiled language should be able to parse it at very high speed. Then it's an issue of what data structure you build & save.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.