Regex - matching all text between two strings

Question

I'm currently parsing a log file that has the following structure:

1) timestamp, preceded by # character and followed by \n

2) arbitrary # of events that happened after that timestamp and all followed by \n

3) repeat..

Here is an exmaple:

#100
04!
03!
02!
#1299
0L
0K
0J
0E
#1335
06!
0X#
0[#
b1010 Z$
b1x [$
...

Please forgive the seemingly cryptic values, they are encodings representing certain "events".

Note: Event encodings may also use the # character.

What I am trying to do is to count the number of events that happen at a certain time.

In other words, at time 100, 3 events happened.

I am trying to match all text between two timestamps - and count the number of events by simply counting the number of newlines enclosed in the matched text.

I'm using Python's regex engine, and I'm using the following expression:

pattern = re.compile('(#[0-9]{2,}.*)(?!#[0-9]+)')

Note: The {2,} is because I want timestamps with at least two digits.

I match a timestamp, continue matching any other characters until hitting another timestamp - ending the matching.

What this returns is:

#100
#1299
#1335

So, I get the timestamps - but none of the events data - what I really care about!

I'm thinking the reason for this is that the negative-lookbehind is "greedy" - but I'm not completely sure.

There may be an entirely different regex that makes this much simpler - open to any suggestions!

Any help is much appreciated!

-k

wim · Accepted Answer · 2012-09-17 01:48:17Z

2

I think a regex is not a good tool for the job here. You can just use a loop..

>>> import collections
>>> d = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
...   t = 'initial'
...   for line in f:
...     if line.startswith('#'):
...       t = line.strip()
...     else:
...       d[t].append(line.strip())
... 
>>> for k,v in d.iteritems():
...   print k, len(v)
... 
#1299 4
#100 3
#1335 6

edited Sep 17, 2012 at 1:48

answered Sep 17, 2012 at 1:43

wim

368k114 gold badges681 silver badges817 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Matthew Adams Over a year ago

Your interpreter uses 2 spaces for indents...?

wim Over a year ago

in python you can use 1 space, 3 space, whatever you prefer .. as long as you are consistent. i like 2 spaces personally ..

Matthew Adams Over a year ago

Huh, TIL. I thought the interpreter enforced PEP8.

kbarber Over a year ago

Although these are good alternatives, I'm still curious as to what the problem is with the original regular expression. I'm on the same page as @BrenBarn that I need to include the DOTALL flag, but even after that there are still issues. The way I'll be using the result - I think that the regex will be the easier and most convenient if I'm able to get it working. Any insights?

wim Over a year ago

regex are very handy and powerful when used appropriately. this problem is surely clearer and cleaner with a code snippet though. imagine someone trying to read/modify your code and having to mentally decipher what the regex is doing .. not fun and not pythonic!

Janne Karila · Accepted Answer · 2012-09-21 06:52:12Z

1

If you insist on a regex-based solution, I propose this:

>>> pat = re.compile(r'(^#[0-9]{2,})\s*\n((?:[^#].*\n)*)', re.MULTILINE)
>>> for t, e in pat.findall(s):
...     print t, e.count('\n')
...
#100 3
#1299 4
#1335 6

Explanation:

(              
  ^            anchor to start of line in multiline mode
  #[0-9]{2,}   line starting with # followed by numbers
)
\s*            skip whitespace just in case (eg. Windows line separator)
\n             new line
(
  (?:          repeat non-capturing group inside capturing group to capture 
               all repetitions
    [^#].*\n   line not starting with #
  )*
)

You seemed to have misunderstood what negative lookahead does. When it follows .*, the regex engine first tries to consume as many characters as possible and only then checks the lookahead pattern. If the lookahead does not match, it will backtrack character by character until it does.

You could, however, use positive lookahead together with the non-greedy .*?. Here the .*? will consume characters until the lookahead sees either a # at start of a line, or the end of the whole string:

re.compile(r'(^#[0-9]{2,})\s*\n(.*?)(?=^#|\Z)', re.DOTALL | re.MULTILINE)

edited Sep 21, 2012 at 6:52

answered Sep 20, 2012 at 12:11

Janne Karila

25.3k6 gold badges59 silver badges97 bronze badges

2 Comments

kbarber Over a year ago

Could you possibly explain further why this version works as opposed to others? Thanks!

Janne Karila Over a year ago

@kbarber Added some explanation.

BrenBarn · Accepted Answer · 2012-09-17 01:23:32Z

1

The reason is that the dot doesn't match newlines, so your expression will only match the lines containing the timestamp; the match won't go across multiple lines. You could pass the "dotall" flag to re.compile so that your expression will match across multiple lines. Since you say the "event encodings" might also contain a # character, you might also want to use the multiline flag and anchor your match with ^ at the beginning so it only matches the # at the beginning of a line.

answered Sep 17, 2012 at 1:23

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

3 Comments

kbarber Over a year ago

Is this what you had in mind? pattern = re.compile('^(#[0-9]{2,}.*)(?!#[0-9]+)', re.DOTALL | re.MULTILINE) -- this version matches ALL of the text. Let me know if I am interpreting something wrong.

Janne Karila Over a year ago

@kbarber The negative lookahead matches anywhere except just before #[0-9]+. It also matches at the end of string. Therefore the greedy .* is able to match all the text.

kbarber Over a year ago

In other words - you're saying that .* gobbles up the next timestamp before the negative-lookahead will recognize it?

Matthew Adams · Accepted Answer · 2012-09-17 01:36:49Z

1

You could just loop through the data line by line and have a dictionary that just stores the number of events associated with each timestamp; no regex required. For example:

with open('exampleData') as example:
    eventCountsDict = {}
    currEvent = None
    for line in example:
        if line[0] == '#': # replace this line with more specific timestamp details if event encodings can start with a '#'
            eventCountsDict[line] = 0
            currEvent = line
        else:
            eventCountsDict[currEvent] += 1

print eventCountsDict

That code prints {'#1299\n': 4, '#1335\n': 5, '#100\n': 3} for your example data (not counting the ...).

edited Sep 17, 2012 at 1:36

answered Sep 17, 2012 at 1:24

Matthew Adams

10.3k3 gold badges31 silver badges43 bronze badges

4 Comments

wim Over a year ago

nice, but collections.defaultdict(int) or even collections.Counter is a more elegant choice here

Matthew Adams Over a year ago

@wim ah collections and itertools - I always forget to use you until after I've gotten some code working already...

kbarber Over a year ago

Thanks, this is a nice alternative. I forgot to mention that the file also has header data at the beginning - that this won't account for. What would be the most elegant solution to not begin adding to the dict until a valid timestamp has been found? Would you use a flag? - that's what comes to mind first, but there may be a cleaner way.

Matthew Adams Over a year ago

@kbarber I'd just ignore lines until I hit the first valid timestamp as I loop through the lines. Not sure there's a more elegant way (unless the header is the same number of bytes every time).

Collectives™ on Stack Overflow

Regex - matching all text between two strings

4 Answers 4

5 Comments

2 Comments

3 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

2 Comments

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related