Parsing a string in C++

Question

I have a huge set of log lines and I need to parse each line (so efficiency is very important).

Each log line is of the form

cust_name time_start time_end (IP or URL )*

So ip address, time, time and a possibly empty list of ip addresses or urls separated by semicolons. If there is only ip or url in the last list there is no separator. If there is more than 1, then they are separated by semicolons.

I need a way to parse this line and read it into a data structure. time_start or time_end could be either system time or GMT. cust_name could also have multiple strings separated by spaces.

I can do this by reading character by character and essentially writing my own parser. Is there a better way to do this ?

Hmmm... can you guarantee that semi-colons don't appear in you urls? Or at least that they don't appear at he ends? — dmckee --- ex-moderator kitten
– dmckee --- ex-moderator kitten, Commented Mar 5, 2009 at 18:44
What's your goal? what are you going to do with the data after you parse it? — hasen
– hasen, Commented Mar 5, 2009 at 19:00

bayda · Accepted Answer · 2009-03-05 18:38:35Z

7

Maybe Boost RegExp lib will help you. http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/index.html

answered Mar 5, 2009 at 18:38

bayda

13.6k9 gold badges42 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Matt Cruikshank Over a year ago

I up-modded, but remember, "Those who attempt to solve a problem using regular expressions now have two problems."

bayda Over a year ago

:) nice quote. But anyway RegExp good solution for small or not significant tasks.

Michael Kristofik · Accepted Answer · 2009-03-05 18:45:16Z

5

I've had success with Boost Tokenizer for this sort of thing. It helps you break an input stream into tokens with custom separators between the tokens.

answered Mar 5, 2009 at 18:45

Michael Kristofik

35.5k16 gold badges78 silver badges128 bronze badges

Comments

begray · Accepted Answer · 2009-03-05 18:53:30Z

4

Using regular expressions (boost::regex is a nice implementation for C++) you can easily separate different parts of your string - cust_name, time_start ... and find all that urls\ips

Second step is more detailed parsing of that groups if needed. Dates for example you can parse using boost::datetime library (writing custom parser if string format isn't standard).

answered Mar 5, 2009 at 18:53

begray

16.3k4 gold badges25 silver badges14 bronze badges

Comments

anon · Accepted Answer · 2009-03-05 18:36:06Z

3

Why do you want to do this in C++? It sounds like an obvious job for something like perl.

answered Mar 5, 2009 at 18:36

anon

5 Comments

dmckee --- ex-moderator kitten Over a year ago

Sure. If he's just doing this job. But the context might be an existing code with some other primary task...

David Thornley Over a year ago

He's interested in performance, and a custom C++ parser will blow the doors off a Perl parser for speed of execution (but not speed of development).

hasen Over a year ago

David, that's not necessarily true. It can very easily backfire on him (in terms of performance) if he stores the resulting gigantic data structure in memory! C++ won't help there.

anon Over a year ago

@david untrue - the regex engine in perl has had untold man years spent on it - you are very unlikely to do as good a job with hand-rolled C++ code

duli Over a year ago

I am using C++ because this is part of a full application where the data structures I create are used by the rest of the app.

Andrew Flanagan · Accepted Answer · 2009-03-05 18:35:25Z

2

Consider using a Regular Expressions library...

answered Mar 5, 2009 at 18:35

Andrew Flanagan

4,2673 gold badges28 silver badges38 bronze badges

1 Comment

dirkgently Over a year ago

And next thing you know, we have another how do I parse URLs question.

dirkgently · Accepted Answer · 2009-03-05 18:34:54Z

1

Custom input demands custom parser. Or, pray that there is an ideal world and errors don't exist. Specially, if you want to have efficiency. Posting some code may be of help.

answered Mar 5, 2009 at 18:34

dirkgently

112k16 gold badges135 silver badges190 bronze badges

Comments

Ylisar · Accepted Answer · 2009-03-05 18:47:31Z

1

for such a simple grammar you can use split, take a look at http://www.boost.org/doc/libs/1_38_0/doc/html/string_algo/usage.html#id4002194

answered Mar 5, 2009 at 18:47

Ylisar

Comments

Community · Accepted Answer · 2017-05-23 12:10:51Z

1

UPDATE changed answer drastically!

I have a huge set of log lines and I need to parse each line (so efficiency is very important).

Just be aware that C++ won't help much in terms of efficiency in this situation. Don't be fooled into thinking that just because you have a fast parsing code in C++ that your program will have high performance!

The efficiency you really need here is not the performance at the "machine code" level of the parsing code, but at the overall algorithm level.

Think about what you're trying to do.
You have a huge text file, and you want to convert each line to a data structure,

Storing huge data structure in memory is very inefficient, no matter what language you're using!

What you need to do is "fetch" one line at a time, convert it to a data structure, and deal with it, then, and only after you're done with the data structure, you go and fetch the next line and convert it to a data structure, deal with it, and repeat.

If you do that, you've already solved the major bottleneck.

For parsing the line of text, it seems the format of your data is quite simplistic, check out a similar question that I asked a while ago: C++ string parsing (python style)

In your case, I suppose you could use a string stream, and use the >> operator to read the next "thing" in the line.

see this answer for example code.

Alternatively, (I didn't want to delete this part!!) If you could write this in python it will be much simpler. I don't know your situation (it seems you're stuck with C++), but still

Look at this presentation for doing these kinds of task efficiently using python generator expressions: http://www.dabeaz.com/generators/Generators.pdf

It's a worth while read. At slide 31 he deals with what seems to be something very similar to what you're trying to do.

It'll at least give you some inspiration.
It also demonstrates quite strongly that performance is gained not by the particular string-parsing code, but the over all algorithm.

edited May 23, 2017 at 12:10

CommunityBot

11 silver badge

answered Mar 5, 2009 at 18:59

hasen

167k66 gold badges199 silver badges235 bronze badges

2 Comments

dmckee --- ex-moderator kitten Over a year ago

I think you are conflating a good idea (Process one line at a time) with one that depends on the context (don't use c++ for this). Moreover, the OP notes in the comments to another answer that he's doing this in an existing c++ code. Nonetheless, +1 for the one-at-a-time point.

hasen Over a year ago

good point! I changed the answer. but in my defense though, he mentioned the existing C++ app quite a while after I posted my answer

Pierre · Accepted Answer · 2009-03-05 18:35:08Z

0

You could try to use a simple lex/yacc|flex/bison vocabulary to parse this kind of input.

answered Mar 5, 2009 at 18:35

Pierre

35.4k33 gold badges120 silver badges197 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 10:33:16Z

0

The parser you need sounds really simple. Take a look at this. Any compiled language should be able to parse it at very high speed. Then it's an issue of what data structure you build & save.

edited May 23, 2017 at 10:33

CommunityBot

11 silver badge

answered Mar 6, 2009 at 19:31

Mike Dunlavey

40.8k15 gold badges95 silver badges140 bronze badges

Collectives™ on Stack Overflow

Parsing a string in C++

10 Answers 10

2 Comments

Comments

Comments

5 Comments

1 Comment

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

2 Comments

Comments

Comments

5 Comments

1 Comment

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related