2

I'm currently parsing a log file that has the following structure:

1) timestamp, preceded by # character and followed by \n

2) arbitrary # of events that happened after that timestamp and all followed by \n

3) repeat..

Here is an exmaple:

#100
04!
03!
02!
#1299
0L
0K
0J
0E
#1335
06!
0X#
0[#
b1010 Z$
b1x [$
...

Please forgive the seemingly cryptic values, they are encodings representing certain "events".

Note: Event encodings may also use the # character.

What I am trying to do is to count the number of events that happen at a certain time.

In other words, at time 100, 3 events happened.

I am trying to match all text between two timestamps - and count the number of events by simply counting the number of newlines enclosed in the matched text.

I'm using Python's regex engine, and I'm using the following expression:

pattern = re.compile('(#[0-9]{2,}.*)(?!#[0-9]+)')

Note: The {2,} is because I want timestamps with at least two digits.

I match a timestamp, continue matching any other characters until hitting another timestamp - ending the matching.

What this returns is:

#100
#1299
#1335

So, I get the timestamps - but none of the events data - what I really care about!

I'm thinking the reason for this is that the negative-lookbehind is "greedy" - but I'm not completely sure.

There may be an entirely different regex that makes this much simpler - open to any suggestions!

Any help is much appreciated!

-k

4 Answers 4

2

I think a regex is not a good tool for the job here. You can just use a loop..

>>> import collections
>>> d = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
...   t = 'initial'
...   for line in f:
...     if line.startswith('#'):
...       t = line.strip()
...     else:
...       d[t].append(line.strip())
... 
>>> for k,v in d.iteritems():
...   print k, len(v)
... 
#1299 4
#100 3
#1335 6
Sign up to request clarification or add additional context in comments.

5 Comments

Your interpreter uses 2 spaces for indents...?
in python you can use 1 space, 3 space, whatever you prefer .. as long as you are consistent. i like 2 spaces personally ..
Huh, TIL. I thought the interpreter enforced PEP8.
Although these are good alternatives, I'm still curious as to what the problem is with the original regular expression. I'm on the same page as @BrenBarn that I need to include the DOTALL flag, but even after that there are still issues. The way I'll be using the result - I think that the regex will be the easier and most convenient if I'm able to get it working. Any insights?
regex are very handy and powerful when used appropriately. this problem is surely clearer and cleaner with a code snippet though. imagine someone trying to read/modify your code and having to mentally decipher what the regex is doing .. not fun and not pythonic!
1

If you insist on a regex-based solution, I propose this:

>>> pat = re.compile(r'(^#[0-9]{2,})\s*\n((?:[^#].*\n)*)', re.MULTILINE)
>>> for t, e in pat.findall(s):
...     print t, e.count('\n')
...
#100 3
#1299 4
#1335 6

Explanation:

(              
  ^            anchor to start of line in multiline mode
  #[0-9]{2,}   line starting with # followed by numbers
)
\s*            skip whitespace just in case (eg. Windows line separator)
\n             new line
(
  (?:          repeat non-capturing group inside capturing group to capture 
               all repetitions
    [^#].*\n   line not starting with #
  )*
)

You seemed to have misunderstood what negative lookahead does. When it follows .*, the regex engine first tries to consume as many characters as possible and only then checks the lookahead pattern. If the lookahead does not match, it will backtrack character by character until it does.

You could, however, use positive lookahead together with the non-greedy .*?. Here the .*? will consume characters until the lookahead sees either a # at start of a line, or the end of the whole string:

re.compile(r'(^#[0-9]{2,})\s*\n(.*?)(?=^#|\Z)', re.DOTALL | re.MULTILINE)

2 Comments

Could you possibly explain further why this version works as opposed to others? Thanks!
@kbarber Added some explanation.
1

The reason is that the dot doesn't match newlines, so your expression will only match the lines containing the timestamp; the match won't go across multiple lines. You could pass the "dotall" flag to re.compile so that your expression will match across multiple lines. Since you say the "event encodings" might also contain a # character, you might also want to use the multiline flag and anchor your match with ^ at the beginning so it only matches the # at the beginning of a line.

3 Comments

Is this what you had in mind? pattern = re.compile('^(#[0-9]{2,}.*)(?!#[0-9]+)', re.DOTALL | re.MULTILINE) -- this version matches ALL of the text. Let me know if I am interpreting something wrong.
@kbarber The negative lookahead matches anywhere except just before #[0-9]+. It also matches at the end of string. Therefore the greedy .* is able to match all the text.
In other words - you're saying that .* gobbles up the next timestamp before the negative-lookahead will recognize it?
1

You could just loop through the data line by line and have a dictionary that just stores the number of events associated with each timestamp; no regex required. For example:

with open('exampleData') as example:
    eventCountsDict = {}
    currEvent = None
    for line in example:
        if line[0] == '#': # replace this line with more specific timestamp details if event encodings can start with a '#'
            eventCountsDict[line] = 0
            currEvent = line
        else:
            eventCountsDict[currEvent] += 1

print eventCountsDict

That code prints {'#1299\n': 4, '#1335\n': 5, '#100\n': 3} for your example data (not counting the ...).

4 Comments

nice, but collections.defaultdict(int) or even collections.Counter is a more elegant choice here
@wim ah collections and itertools - I always forget to use you until after I've gotten some code working already...
Thanks, this is a nice alternative. I forgot to mention that the file also has header data at the beginning - that this won't account for. What would be the most elegant solution to not begin adding to the dict until a valid timestamp has been found? Would you use a flag? - that's what comes to mind first, but there may be a cleaner way.
@kbarber I'd just ignore lines until I hit the first valid timestamp as I loop through the lines. Not sure there's a more elegant way (unless the header is the same number of bytes every time).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.