3

I have an excel worksheet that I am reading into pandas for parsing and later analysis. It has the following format. All values are strings. They will be converted to floats/ints later but having them as strings helps with parsing.

column1  |  column2 | column3 |
-----------------------------
12345   |10         |20       |
txt     |25         |65       |
35615   |15         |20       |
txt     |35         |20       |

I need to get the index of all 5 digit, numerical values in column1. It will always be a 5 digit. I am using the following regex.

\b\d{5}\b

I am having problems getting pandas to properly match the 5 digits when using any of the built in string methods.

I have tried the following.

df.column1.str.contains('\b\d{5}\b', regex=True).index.list()
df.column1.str.match('\b\d{5}\b').index.list()

I am expecting it to return

[0,2]

Both of these return an empty list. What am I doing wrong?

1 Answer 1

3

Add r before string, filter by boolean indexing and get index values to list:

i = df[df.column1.str.contains(r'\b\d{5}\b')].index.tolist()
print (i)
[0, 2]

Or if want parse only numeric values with length 5 change regex with ^ and $ for start and end of string:

i = df[df.column1.str.contains(r'^\d{5}$')].index.tolist()
Sign up to request clarification or add additional context in comments.

1 Comment

I missed the raw string. Does python always need a regex to be passed as a raw string for it to function correctly?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.