Using regular expressing in python

Question

I have a couple of huge log files which contains a list of activity names and sub-activities with a numerical value associated with each sub activity. I need to write a script to automate the data analysis process. I used Regex to get a pattern match for my main activity by doing a word by word search.Now, I have to find the sub-activity and get the numerical value associated with it.

For example: "Out: Packet Sizes Histogram Bucket 5=10" I need to check for the sub-activity Out: Packet Sizes and get the Histogram Bucket value 5=10. There are a list of sub-activities like this. In my word search technique I find it hard to get a pattern match for my sub-activity. What regex pattern should i use to get the 5=10 value when the pattern matches the entire text before that?

PS: All the sub-activities has the text "Histogram Bucket" repeated. I would greatly appreciate your suggestions to address this issue. I have just started learning regex and python.

Have you looked at capture groups in regexp?

Barmar
– Barmar

2014-12-09 20:20:24 +00:00
Commented Dec 9, 2014 at 20:20 — Barmar
– Barmar, Commented Dec 9, 2014 at 20:20
No, I haven't. Will take a look at it now

SRS
– SRS

2014-12-09 20:37:49 +00:00
Commented Dec 9, 2014 at 20:37 — SRS
– SRS, Commented Dec 9, 2014 at 20:37

PeterE · Accepted Answer · 2014-12-09 22:06:43Z

1

(1) If you want to use one regular expression you could use:

known_activities = ['Out: Packet Sizes'] 
# you might have to use '\s' or '\ ' to protect the whitespaces.
activity_exprs = [a.replace(' ', '\s') for a in known_activities]

regexpr = r'('+'|'.join(activity_exprs)+r')\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)

match = pattern.match(input)
if match:
  print('Activity: '+match.group(1))
  print('Bucket:   '+match.group(2))

(2) If you don't want (or have to) match the activities, it you could also go simply with:

regexpr = r'(.*?)\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)

match = pattern.match(input)
if match:
  print('Activity: '+match.group(1))
  print('Bucket:   '+match.group(2))

(3) If you do want to match activities you can always do so in a separate step:

if match:
   activity = match.group(1)
   if activity in known_activities:
     print('Activity: '+activity )
     print('Bucket:   '+match.group(2))

EDIT Some more details and explanations:

items = ['a','b','c']
'|'.join(items)

produces a|b|c. Used in regular expressions | denotes alternatives, e.g. r'a(b|c)a' will match either 'aba' or 'aca'. So in (1) I basically chained all known activities as alternatives together. Each activity has to be a valid regular expression in it self (that is why any 'special' characters (e.g. whitespace) should be properly escaped). One could simply mash together all alternatives by hand into one large regular expression, but that gets unwieldy and error prone fast, if there are more than a couple of activities.

All in all you are probably better of using (2) and if necessary (3) or a separate regular expression as a secondary stage.

EDIT2 regarding your sample line you could also use:

regexpr = r'([^\s]*?)\s([^\s]*?)\s([^\s]*?)\s(.*?)\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)

match = pattern.match(input)
if match:
  print('Date:     '+match.group(1))
  print('Time:     '+match.group(2))
  print('Activity: '+match.group(3))
  print('Sub:      '+match.group(4))
  print('Bucket:   '+match.group(5))

EDIT3 pattern.match(input) expects to find the pattern directly at the beginning of the input string. That means 'a' will match 'a' or 'abc' but not 'ba'. If your pattern does not start at the beginning you have to prepend '.*?' to your regular expression to consume as much arbitrary characters as necessary.

'\s' matches any whitespace character, '[^\s]' matches any character that is NOT whitespace.

If you want to learn more about regular expressions, the python HOWTO on that matter is quite good.

edited Dec 9, 2014 at 22:06

answered Dec 9, 2014 at 20:20

PeterE

5,8656 gold badges33 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

SRS Over a year ago

I tried this but I am not getting any result. I fed each line from my file as the input for pattern.match(input). Is there something else I should be doing before this?

PeterE Over a year ago

could you add a sample (line) to your post, as it stands I have nothing to test against my code.

SRS Over a year ago

My bad. I rectified the issue. Now I am getting the output. It is printing the entire line if I use your second script. I just have to tweak it a bit. Thanks a lot @Peter :)

PeterE Over a year ago

The indices for the groups where of by one. I corrected my post.

SRS Over a year ago

09/12/14 17:13:29 Process_Name Out: Packet Sizes Histogram Bucket 6=4 This is how the lines in my file looks like. When i use the (.*?) pattern it is printing me the whole line while using ['Out:..'] pattern doesn't get me a match. Could you please tell me what does the ['+'|'.] before join in first script does? I would like to display the result by the different sub_activities. The script is working well, I just need to scrape off the time stamp and main activity.Refer the example line mentioned above.

|

Collectives™ on Stack Overflow

Using regular expressing in python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related