Get index of regex match in pandas dataframe not working

Question

I have an excel worksheet that I am reading into pandas for parsing and later analysis. It has the following format. All values are strings. They will be converted to floats/ints later but having them as strings helps with parsing.

column1  |  column2 | column3 |
-----------------------------
12345   |10         |20       |
txt     |25         |65       |
35615   |15         |20       |
txt     |35         |20       |

I need to get the index of all 5 digit, numerical values in column1. It will always be a 5 digit. I am using the following regex.

\b\d{5}\b

I am having problems getting pandas to properly match the 5 digits when using any of the built in string methods.

I have tried the following.

df.column1.str.contains('\b\d{5}\b', regex=True).index.list()
df.column1.str.match('\b\d{5}\b').index.list()

I am expecting it to return

[0,2]

Both of these return an empty list. What am I doing wrong?

jezrael · Accepted Answer · 2020-01-26 07:36:38Z

3

Add r before string, filter by boolean indexing and get index values to list:

i = df[df.column1.str.contains(r'\b\d{5}\b')].index.tolist()
print (i)
[0, 2]

Or if want parse only numeric values with length 5 change regex with ^ and $ for start and end of string:

i = df[df.column1.str.contains(r'^\d{5}$')].index.tolist()

answered Jan 26, 2020 at 7:36

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Rusty Over a year ago

I missed the raw string. Does python always need a regex to be passed as a raw string for it to function correctly?

Collectives™ on Stack Overflow

Get index of regex match in pandas dataframe not working

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related