90

I have read some pricing data into a pandas dataframe the values appear as:

$40,000*
$40000 conditions attached

I want to strip it down to just the numeric values. I know I can loop through and apply regex

[0-9]+

to each field then join the resulting list back together but is there a not loopy way?

6 Answers 6

167

You could use Series.str.replace:

import pandas as pd

df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
#                             P
# 0                    $40,000*
# 1  $40000 conditions attached

df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)

yields

       P
0  40000
1  40000

since \D matches any character that is not a decimal digit.

Sign up to request clarification or add additional context in comments.

1 Comment

Any way to do this inPlace?
28

You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'

import pandas as pd

df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0  40,000.32
1      40000

Comments

22

You could remove all the non-digits using re.sub():

value = re.sub(r"[^0-9]+", "", value)

regex101 demo

8 Comments

\D+ will be the smallest :-P
whats the best way to apply it to the column in the dataframe? so I have df['pricing'] do I just loop row by row?
ok I think I got it for pandas use: df['Pricing'].replace(to_replace='[^0-9]+', value='',inplace==True,regex=True) the .replace method uses re.sub
caution - stripping all non digit symbols would remove negative sign decimal point, and join together unrelated numbers, e.g. "$8.99 but $2 off with coupon" becomes "8992", "$5.99" becomes "499", "$5" becomes "5".
@KillerSnail Your solution needs one correction: The double equals (==) after inplace should be replaced by single equals (=) df['Pricing'].replace(to_replace='[^0-9]+', value='',inplace=True,regex=True)
|
8

You don't need regex for this. This should work:

df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)

Comments

1

In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub

To apply this on my entire column, here's the code.

#add_map is rules of replacement for the strings in pd df.
add_map = dict([
    ("AV", "Avenue"),
    ("BV", "Boulevard"),
    ("BP", "Bypass"), 
    ("BY", "Bypass"),
    ("CL", "Circle"),
    ("DR", "Drive"),
    ("LA", "Lane"),
    ("PY", "Parkway"),
    ("RD", "Road"),
    ("ST", "Street"),
    ("WY", "Way"),
    ("TR", "Trail"),
    
      
])

obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
    rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
    rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
    obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it! 

Hope this helps anyone searching for the problem I had. Cheers

1 Comment

The rule2 = (lambda... is used as a callable, therefore in your obj.str.replace the regex is passed the match object, i.e. your dictionary key to lookup the value pair to replace. Read pandas.Series.str.replace and dict.get() for more information. If anyone has any clarification on the m.group() function please let me know.
0

You can also use .replace() directly by passing the pattern as the regex= argument and the replacement value as value= argument.

df = pd.DataFrame({'col': ["$40,000*", "$40000 conditions attached"]})
df['col'] = df['col'].replace(regex=r'\D+', value='')

It method performs just as fast as the str.replace method (because both are syntactic sugar for a Python loop). However, the advantage of this method over str.replace is that it can replace values in multiple columns in one call. For a dataframe of string values, one can use:

df = df.replace(regex=r'\D+', value='')

the equivalent syntax using str.replace would be:

df = df.apply(lambda col: col.str.replace(r'\D+', '', regex=True))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.