Pandas replace characters within string based on regex match?

Question

I want to replace some characters within a string in pandas (based on a match to the entirety of the string), while leaving the rest of the string unchanged.

For instance, replace dashes with decimals in a number string IF the dash isn't at the start of the number string:

'26.15971' -> '26.15971'

'1030899' -> '1030899'

'26-404700' -> '26.404700'

'-26-403268' -> '-26.403268'

Code:

# --- simple dataframe
df = pd.DataFrame({'col1':['26.15971','1030899','26-404700']})

# --- regex that only matches items of interest
regex_match = '^\d{1,2}-\d{1,8}'
df.col1.str.match(regex_match)

# --- not sure how to only replace the middle hypens?
# something like  df.col1.str.replace('^\d{1,2}(-)\d{1,8}','^\d{1,2}\.\d{1,8}') ??
# unclear how to get a repl that only alters a capture group and leaves the rest 
# of the string unchanged

Tim Biegeleisen · Accepted Answer · 2020-10-08 03:01:54Z

1

You could try using a regex replacement with lookarounds:

df["col1"] = df["col1"].str.replace("(?<=\d)-(?=\d)", ".")

The regex pattern (?<=\d)-(?=\d) targets every dash sitting in between two numbers and replaces it with dot.

We could also approach this using capture groups:

df["col1"] = df["col1"].str.replace("(\d{2,3})-(\d{4,8})", "\\1.\\2")

edited Oct 8, 2020 at 3:01

answered Oct 8, 2020 at 2:48

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Mark_Anderson Over a year ago

Very nice! So I think a positive lookbehind can't be variable width like (?<=\d{2,3}) which reduces flexibility in the match. Any thoughts?

Tim Biegeleisen Over a year ago

@Mark_Anderson Actually, you can use a variable width positive lookahead, so this would be legitimate also: (?<=\d)-(?=\d{2,3})

Mark_Anderson Over a year ago

Agreed, but is there a way to get flexibility in the lookbehind? I still really like the solution , but wondering if there is a way to get full flexibility (If lookbehind can't be flexible, maybe get 3 capture groups(\d{2,3})(?P<hypen>-)(\d{4,8}) and only swap out the middle capture group that has the hypen?)

Tim Biegeleisen Over a year ago

@Mark_Anderson I don't know why you think you need this, but I updated my answer anyway. I think just asserting that even one digit be on either side of the dash should be OK logic here.

Mark_Anderson Over a year ago

Just a general principle. More flexibility is more good. Mostly in case someone with a similar problem ends up on this post.

Collectives™ on Stack Overflow

Pandas replace characters within string based on regex match?

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related