How to remove characters with special strings using regular expression in Python?

Question

I am trying to clean up a log and I want to remove some special strings

Example:

%/h >  %/h Current value over threshold value
Pg/S >  Pg/S Current value over threshold value
Pg/S >  Pg/S  No. of pages paged in exceeds threshold
MB <  MB   min. avg. value over threshold value

I have tried to use some patterns but it seems not to work.

re.sub(r'\w\w\/\s>\s\w','',text)

Is there any good idea for me to remove the special pattern?

I want to remove the .../...>.../...

I expect my output to only contain useful words.

   Current value over threshold value
   No. of pages paged in exceeds threshold
   min. avg. value over threshold value

Thank you for any idea!

Is the content before and after the > always the same? Matching ^([^\s>]*)\s+>\s+\1 would be my idea then. — Sebastian Proske
– Sebastian Proske, Commented Nov 10, 2016 at 1:41
Is it always going to be spaced that way. In other words, is the string of interest always going to be after the third space? — idjaw
– idjaw, Commented Nov 10, 2016 at 1:41

ekhumoro · Accepted Answer · 2016-11-10 02:47:58Z

3

Assuming the structure of the file is:

[special-string] [< or >] [special-string] [message]

then this should work:

>>> rgx = re.compile(r'^[^<>]+[<>] +\S+ +', re.M)
>>>
>>> s = """
... %/h >  %/h Current value over threshold value
... Pg/S >  Pg/S Current value over threshold value
... Pg/S >  Pg/S  No. of pages paged in exceeds threshold
... MB <  MB   min. avg. value over threshold value
... """
>>>
>>> print(rgx.sub('', s))
Current value over threshold value
Current value over threshold value
No. of pages paged in exceeds threshold
min. avg. value over threshold value

answered Nov 10, 2016 at 2:47

ekhumoro

122k23 gold badges272 silver badges400 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

zihan meng Over a year ago

May I ask you why do you use ^ at beginning? Is it to indicate a initial position where the pattern start?

ekhumoro Over a year ago

@zihanmeng. Yes - it means "match the beginning of a line". That is also why the re.M flag is needed (i.e. multi-line matching).

zihan meng Over a year ago

I got it! Thank you!

idjaw · Accepted Answer · 2016-11-10 01:54:15Z

3

Based on the pattern you are trying to match on, it seems like you always know where the string is positioned. You can actually do this without regex, and just make use of split and slicing to get the section of interest. Finally, use join to bring back in to a string, for your final result.

The below result will do the following:

s.split() - split on space creating a list where each words will be an entry in the list

[3:] - slice the list by taking everything from the fourth position (0 indexing)

' '.join() - Will convert back to a string, placing a space between each element from the list

Demo:

s = "%/h >  %/h Current value over threshold value"
res = ' '.join(s.split()[3:])

Output:

Current value over threshold value

edited Nov 10, 2016 at 1:54

answered Nov 10, 2016 at 1:46

idjaw

26.8k10 gold badges68 silver badges84 bronze badges

Comments

Ibrahim · Accepted Answer · 2016-11-10 17:33:16Z

1

This is a relatively long regex, but it gets the job done.

[%\w][\/\w]\/?[\/\s\w]\s?\<?\>?\s\s[\w%]\/?[a-zA-Z%]\/?[\w]?\s\s?\s?

Demo: https://regex101.com/r/ayh19b/4

Or you can do something like:

^[\s\S]*?(?=\w\w(?:\w|\.))

Demo: https://regex101.com/r/ayh19b/6

edited Nov 10, 2016 at 17:33

answered Nov 10, 2016 at 1:56

Ibrahim

6,1183 gold badges43 silver badges50 bronze badges

Collectives™ on Stack Overflow

How to remove characters with special strings using regular expression in Python?

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related