1

I am trying to clean up a log and I want to remove some special strings

Example:

%/h >  %/h Current value over threshold value
Pg/S >  Pg/S Current value over threshold value
Pg/S >  Pg/S  No. of pages paged in exceeds threshold
MB <  MB   min. avg. value over threshold value

I have tried to use some patterns but it seems not to work.

re.sub(r'\w\w\/\s>\s\w','',text)

Is there any good idea for me to remove the special pattern?

I want to remove the .../...>.../...

I expect my output to only contain useful words.

   Current value over threshold value
   No. of pages paged in exceeds threshold
   min. avg. value over threshold value

Thank you for any idea!

2
  • Is the content before and after the > always the same? Matching ^([^\s>]*)\s+>\s+\1 would be my idea then. Commented Nov 10, 2016 at 1:41
  • Is it always going to be spaced that way. In other words, is the string of interest always going to be after the third space? Commented Nov 10, 2016 at 1:41

3 Answers 3

3

Assuming the structure of the file is:

[special-string] [< or >] [special-string] [message]

then this should work:

>>> rgx = re.compile(r'^[^<>]+[<>] +\S+ +', re.M)
>>>
>>> s = """
... %/h >  %/h Current value over threshold value
... Pg/S >  Pg/S Current value over threshold value
... Pg/S >  Pg/S  No. of pages paged in exceeds threshold
... MB <  MB   min. avg. value over threshold value
... """
>>>
>>> print(rgx.sub('', s))
Current value over threshold value
Current value over threshold value
No. of pages paged in exceeds threshold
min. avg. value over threshold value
Sign up to request clarification or add additional context in comments.

3 Comments

May I ask you why do you use ^ at beginning? Is it to indicate a initial position where the pattern start?
@zihanmeng. Yes - it means "match the beginning of a line". That is also why the re.M flag is needed (i.e. multi-line matching).
I got it! Thank you!
3

Based on the pattern you are trying to match on, it seems like you always know where the string is positioned. You can actually do this without regex, and just make use of split and slicing to get the section of interest. Finally, use join to bring back in to a string, for your final result.

The below result will do the following:

s.split() - split on space creating a list where each words will be an entry in the list

[3:] - slice the list by taking everything from the fourth position (0 indexing)

' '.join() - Will convert back to a string, placing a space between each element from the list

Demo:

s = "%/h >  %/h Current value over threshold value"
res = ' '.join(s.split()[3:])

Output:

Current value over threshold value

Comments

1

This is a relatively long regex, but it gets the job done.

[%\w][\/\w]\/?[\/\s\w]\s?\<?\>?\s\s[\w%]\/?[a-zA-Z%]\/?[\w]?\s\s?\s?

Demo: https://regex101.com/r/ayh19b/4

Or you can do something like:

^[\s\S]*?(?=\w\w(?:\w|\.))

Demo: https://regex101.com/r/ayh19b/6

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.