1

I am trying to extract numbers only from a pandas column

I used .str.extract('([?:\s\d+]+)')

and it seems it worked well, but when I checked the data, there is a row that it is not matching the condition.

Row contains: 86531 86530 86529PIP 91897PIP

Result: 86531 86530 86529

As you can see last value 91897PIP has not been processed. I do not understand why.

What I need is that all the instances where there are numbers are extracted no matter if it has letter, spaces or special char at the beginning or end.

2 Answers 2

3

Your regex doesn't do what you think it does. What you have is a character class, which matches any of the characters in the set ?: \t\r\n\f\v0-9+. So when the regex encounters the first non-matching character (P for your sample data) it stops. It's probably simpler to use replace to get rid of non-whitespace and digit characters:

df = pd.DataFrame({'data':['86531 86530 86529PIP 91897PIP']})
df['data'].str.replace('([^\s\d])', '', regex=True)

Which for your data will give:

86531 86530 86529 91897
Sign up to request clarification or add additional context in comments.

Comments

2

You can pass in a regular expression to .extract() that matches one or more digits:

df = pd.DataFrame({"data": ["86531", "86530", "86529PIP", "91897PIP"]})
df["data"] = df["data"].str.extract(r"(\d+)")
print(df)

This outputs:

    data
0  86531
1  86530
2  86529
3  91897

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.