regular expression using pandas string match

Question

Input data:

                        name  Age Zodiac Grade            City  pahun
0                   /extract   30  Aries     A            Aura  a_b_c
1  /abc/236466/touchbar.html   20    Leo    AB      Somerville  c_d_e
2                    Brenda4   25  Virgo     B  Hendersonville    f_g
3     /abc/256476/mouse.html   18  Libra    AA          Gannon  h_i_j

I am trying to extract the rows based on the regex on the name column. This regex extracts the numbers which has 6 as length.

For example:
/abc/236466/touchbar.html  - 236466

Here is the code I have used

df=df[df['name'].str.match(r'\d{6}') == True]

The above line is not matching at all.

Expected:

                         name  Age Zodiac Grade            City  pahun
0  /abc/236466/touchbar.html   20    Leo    AB      Somerville  c_d_e
1     /abc/256476/mouse.html   18  Libra    AA          Gannon  h_i_j

Can anyone tell me where am I doing wrong?

.match only searches for a match at the start of the string. Use str.contains(r'/\d{6}/') to find entries containing / + 6 digits + / — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 14, 2020 at 15:48

Wiktor Stribiżew · Accepted Answer · 2024-08-15 09:41:58Z

5

str.match only searches for a match at the start of the string. So, if you want to match / + 6 digits + / somewhere inside the string using str.match, you would need to use one of

df=df[df['name'].str.match(r'.*/\d{6}/')]      # assuming the match is closer to the end of the string
df=df[df['name'].str.match(r'(?s).*/\d{6}/')]  # same, but allows a multiline search
df=df[df['name'].str.match(r'.*?/\d{6}/')]     # assuming the match is closer to the start of the string
df=df[df['name'].str.match(r'(?s).*?/\d{6}/')] # same, but allows a multiline search

However, it is more reasonable and efficient here to use str.contains with a regex like

df=df[df['name'].str.contains(r'/\d{6}/')]

to find entries containing / + 6 digits + /.

Or, to make sure you just match 6 digit chunks and not 7+ digit chunks:

df=df[df['name'].str.contains(r'(?<!\d)\d{6}(?!\d)')]

where

(?<!\d) - makes sure there is no digit on the left
\d{6} - any six digits
(?!\d) - no digit on the right is allowed.

edited Aug 15, 2024 at 9:41

answered Jul 14, 2020 at 15:53

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wiktor Stribiżew Over a year ago

Just FYI, especially for regex novices: r'/\d{6}/' != r'\d{6}'!!! The forward slashes are part of the regex pattern here and there will only be a match if there are literal / chars on both sides of the 6-digit substring. Do not confuse regex declarations using a string literal (like here) and regex literal (as in perl, JavaScript, Ruby and some other languages) where supported where / are used as regex delimiters.

YOLO · Accepted Answer · 2020-07-14 15:51:40Z

0

You are almost there, use str.contains instead:

df[df['name'].str.contains(r'\d{6,}')]

answered Jul 14, 2020 at 15:51

YOLO

22k5 gold badges25 silver badges42 bronze badges

Collectives™ on Stack Overflow

regular expression using pandas string match

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related