-1

I was to use regex to replace a substring of a matched string in a df series. I have looked through the documentation (e.g. HERE ) and I have found a solution that is able to capture the specific type of string that I want to match. However, during the replace, it does not replace the substring.

I have cases such as

data
initthe problem
nationthe airline
radicthe groups
professionthe experience
the cat in the hat

In this particular case, I am interested in substituting "the" with "al" in those cases where "the" is not a standalone string (i.e. preceeded and followed by whitespaces).

I have tried the following solution:

patt = re.compile(r'(?:[a-z])(the)')
df['data'].str.replace(patt, r'al')

However, it also replaces the non-whitespace character preceding the "the".

Any suggestions on how what I can do to just repalce those specific cases of a substring?

1
  • But inithe will turn into inial, I guess you need initial? Even if you fix it to df['data'].str.replace(r'(?<=[a-z])the', r'al') Commented Oct 8, 2018 at 10:34

1 Answer 1

1

Try using a lookbehind, which checks (asserts) for a character before the, but does not actually consume anything:

input = "data\ninitthe problem\nnationthe airline\nradicthe groups\nprofessionthe experience\nthe cat in the hat"

output = re.sub(r'(?<=[a-z])the', 'al', input)
print(output)

data
inital problem
national airline
radical groups
professional experience
the cat in the hat

Demo

Sign up to request clarification or add additional context in comments.

4 Comments

Though it is what OP tries to use, the result will probably not be "final" since inithe will turn into inial.
@WiktorStribiżew I interpret this as bad sample data, not a bad regex solution.
Well, another dupe anyway.
Yes, sorry. There was an error in the simple data I updated it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.