2

I have a couple of huge log files which contains a list of activity names and sub-activities with a numerical value associated with each sub activity. I need to write a script to automate the data analysis process. I used Regex to get a pattern match for my main activity by doing a word by word search.Now, I have to find the sub-activity and get the numerical value associated with it.

For example: "Out: Packet Sizes Histogram Bucket 5=10" I need to check for the sub-activity Out: Packet Sizes and get the Histogram Bucket value 5=10. There are a list of sub-activities like this. In my word search technique I find it hard to get a pattern match for my sub-activity. What regex pattern should i use to get the 5=10 value when the pattern matches the entire text before that?

PS: All the sub-activities has the text "Histogram Bucket" repeated. I would greatly appreciate your suggestions to address this issue. I have just started learning regex and python.

2
  • Have you looked at capture groups in regexp? Commented Dec 9, 2014 at 20:20
  • No, I haven't. Will take a look at it now Commented Dec 9, 2014 at 20:37

1 Answer 1

1

(1) If you want to use one regular expression you could use:

known_activities = ['Out: Packet Sizes'] 
# you might have to use '\s' or '\ ' to protect the whitespaces.
activity_exprs = [a.replace(' ', '\s') for a in known_activities]

regexpr = r'('+'|'.join(activity_exprs)+r')\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)

match = pattern.match(input)
if match:
  print('Activity: '+match.group(1))
  print('Bucket:   '+match.group(2))

(2) If you don't want (or have to) match the activities, it you could also go simply with:

regexpr = r'(.*?)\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)

match = pattern.match(input)
if match:
  print('Activity: '+match.group(1))
  print('Bucket:   '+match.group(2))

(3) If you do want to match activities you can always do so in a separate step:

if match:
   activity = match.group(1)
   if activity in known_activities:
     print('Activity: '+activity )
     print('Bucket:   '+match.group(2))

EDIT Some more details and explanations:

items = ['a','b','c']
'|'.join(items)

produces a|b|c. Used in regular expressions | denotes alternatives, e.g. r'a(b|c)a' will match either 'aba' or 'aca'. So in (1) I basically chained all known activities as alternatives together. Each activity has to be a valid regular expression in it self (that is why any 'special' characters (e.g. whitespace) should be properly escaped). One could simply mash together all alternatives by hand into one large regular expression, but that gets unwieldy and error prone fast, if there are more than a couple of activities.

All in all you are probably better of using (2) and if necessary (3) or a separate regular expression as a secondary stage.

EDIT2 regarding your sample line you could also use:

regexpr = r'([^\s]*?)\s([^\s]*?)\s([^\s]*?)\s(.*?)\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)

match = pattern.match(input)
if match:
  print('Date:     '+match.group(1))
  print('Time:     '+match.group(2))
  print('Activity: '+match.group(3))
  print('Sub:      '+match.group(4))
  print('Bucket:   '+match.group(5))

EDIT3 pattern.match(input) expects to find the pattern directly at the beginning of the input string. That means 'a' will match 'a' or 'abc' but not 'ba'. If your pattern does not start at the beginning you have to prepend '.*?' to your regular expression to consume as much arbitrary characters as necessary.

'\s' matches any whitespace character, '[^\s]' matches any character that is NOT whitespace.

If you want to learn more about regular expressions, the python HOWTO on that matter is quite good.

Sign up to request clarification or add additional context in comments.

6 Comments

I tried this but I am not getting any result. I fed each line from my file as the input for pattern.match(input). Is there something else I should be doing before this?
could you add a sample (line) to your post, as it stands I have nothing to test against my code.
My bad. I rectified the issue. Now I am getting the output. It is printing the entire line if I use your second script. I just have to tweak it a bit. Thanks a lot @Peter :)
The indices for the groups where of by one. I corrected my post.
09/12/14 17:13:29 Process_Name Out: Packet Sizes Histogram Bucket 6=4 This is how the lines in my file looks like. When i use the (.*?) pattern it is printing me the whole line while using ['Out:..'] pattern doesn't get me a match. Could you please tell me what does the ['+'|'.] before join in first script does? I would like to display the result by the different sub_activities. The script is working well, I just need to scrape off the time stamp and main activity.Refer the example line mentioned above.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.