pandas applying regex to replace values

Question

I have read some pricing data into a pandas dataframe the values appear as:

$40,000*
$40000 conditions attached

I want to strip it down to just the numeric values. I know I can loop through and apply regex

[0-9]+

to each field then join the resulting list back together but is there a not loopy way?

wjandrea · Accepted Answer · 2022-01-03 03:12:20Z

167

You could use Series.str.replace:

import pandas as pd

df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
#                             P
# 0                    $40,000*
# 1  $40000 conditions attached

df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)

yields

       P
0  40000
1  40000

since \D matches any character that is not a decimal digit.

edited Jan 3, 2022 at 3:12

wjandrea

33.9k10 gold badges69 silver badges105 bronze badges

answered Mar 23, 2014 at 12:39

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

SagarM Over a year ago

Any way to do this inPlace?

Pluto · Accepted Answer · 2017-01-12 09:33:19Z

28

You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'

import pandas as pd

df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0  40,000.32
1      40000

edited Jan 12, 2017 at 9:33

answered Jan 12, 2017 at 9:17

Pluto

85410 silver badges9 bronze badges

Comments

Jerry · Accepted Answer · 2014-03-23 07:51:48Z

22

You could remove all the non-digits using re.sub():

value = re.sub(r"[^0-9]+", "", value)

regex101 demo

answered Mar 23, 2014 at 7:51

Jerry

71.8k14 gold badges106 silver badges148 bronze badges

8 Comments

Sabuj Hassan Over a year ago

\D+ will be the smallest :-P

KillerSnail Over a year ago

whats the best way to apply it to the column in the dataframe? so I have df['pricing'] do I just loop row by row?

KillerSnail Over a year ago

ok I think I got it for pandas use: df['Pricing'].replace(to_replace='[^0-9]+', value='',inplace==True,regex=True) the .replace method uses re.sub

ChuckCottrill Over a year ago

caution - stripping all non digit symbols would remove negative sign decimal point, and join together unrelated numbers, e.g. "$8.99 but $2 off with coupon" becomes "8992", "$5.99" becomes "499", "$5" becomes "5".

Tapa Dipti Sitaula Over a year ago

@KillerSnail Your solution needs one correction: The double equals (==) after inplace should be replaced by single equals (=) df['Pricing'].replace(to_replace='[^0-9]+', value='',inplace=True,regex=True)

|

samthebrand · Accepted Answer · 2016-01-23 20:23:26Z

8

You don't need regex for this. This should work:

df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)

answered Jan 23, 2016 at 20:23

samthebrand

3,1107 gold badges45 silver badges58 bronze badges

Comments

E. Goldsmi · Accepted Answer · 2021-01-28 04:37:34Z

1

In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub

To apply this on my entire column, here's the code.

#add_map is rules of replacement for the strings in pd df.
add_map = dict([
    ("AV", "Avenue"),
    ("BV", "Boulevard"),
    ("BP", "Bypass"), 
    ("BY", "Bypass"),
    ("CL", "Circle"),
    ("DR", "Drive"),
    ("LA", "Lane"),
    ("PY", "Parkway"),
    ("RD", "Road"),
    ("ST", "Street"),
    ("WY", "Way"),
    ("TR", "Trail"),
    
      
])

obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
    rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
    rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
    obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!

Hope this helps anyone searching for the problem I had. Cheers

answered Jan 28, 2021 at 4:37

E. Goldsmi

256 bronze badges

1 Comment

bozsudo Over a year ago

The rule2 = (lambda... is used as a callable, therefore in your obj.str.replace the regex is passed the match object, i.e. your dictionary key to lookup the value pair to replace. Read pandas.Series.str.replace and dict.get() for more information. If anyone has any clarification on the m.group() function please let me know.

cottontail · Accepted Answer · 2023-11-29 08:58:15Z

0

You can also use .replace() directly by passing the pattern as the regex= argument and the replacement value as value= argument.

df = pd.DataFrame({'col': ["$40,000*", "$40000 conditions attached"]})
df['col'] = df['col'].replace(regex=r'\D+', value='')

It method performs just as fast as the str.replace method (because both are syntactic sugar for a Python loop). However, the advantage of this method over str.replace is that it can replace values in multiple columns in one call. For a dataframe of string values, one can use:

df = df.replace(regex=r'\D+', value='')

the equivalent syntax using str.replace would be:

df = df.apply(lambda col: col.str.replace(r'\D+', '', regex=True))

answered Nov 29, 2023 at 8:58

cottontail

25.6k25 gold badges184 silver badges176 bronze badges

Collectives™ on Stack Overflow

pandas applying regex to replace values

6 Answers 6

1 Comment

Comments

8 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

Comments

8 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related