Python regular expression to replace everything but specific words

Question

I am trying to do the following with a regular expression:

import re
x = re.compile('[^(going)|^(you)]')    # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)

The result I get is:

'_____going__o___no______n__you_'

The result I want is:

'_____going_________________you_'

Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.

I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.

FYI only: The reason your [^(going)|^(you)] fails is because the [..] syntax results in a one character match only. The ^ at the very start is special, indeed meaning 'not', but everything after that is considered a custom set of characters: ()^ginouy|. — Jongware
– Jongware, Commented Jul 6, 2016 at 10:26

cdarke · Accepted Answer · 2016-07-06 10:18:40Z

Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:

import re

def subit(m):
    stuff, word = m.groups()
    return ("_" * len(stuff)) + word

s = 'I am going home now, thank you.' # string to modify

print re.sub(r'(.+?)(going|you|$)', subit, s)

Gives:

_____going_________________you_

To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).

subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.

Kasravnd · Accepted Answer · 2016-07-06 10:38:08Z

3

Here is a one regex approach:

>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'

The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

edited Jul 6, 2016 at 10:38

answered Jul 6, 2016 at 10:22

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

2 Comments

cdarke Over a year ago

The final . before end-of-text? Also, not enough underscores between going and you.

Kasravnd Over a year ago

@cdarke Yes, seems so, let me check!

Collectives™ on Stack Overflow

Python regular expression to replace everything but specific words

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related