3

Input data:

                        name  Age Zodiac Grade            City  pahun
0                   /extract   30  Aries     A            Aura  a_b_c
1  /abc/236466/touchbar.html   20    Leo    AB      Somerville  c_d_e
2                    Brenda4   25  Virgo     B  Hendersonville    f_g
3     /abc/256476/mouse.html   18  Libra    AA          Gannon  h_i_j

I am trying to extract the rows based on the regex on the name column. This regex extracts the numbers which has 6 as length.

For example:
/abc/236466/touchbar.html  - 236466

Here is the code I have used

df=df[df['name'].str.match(r'\d{6}') == True]

The above line is not matching at all.

Expected:

                         name  Age Zodiac Grade            City  pahun
0  /abc/236466/touchbar.html   20    Leo    AB      Somerville  c_d_e
1     /abc/256476/mouse.html   18  Libra    AA          Gannon  h_i_j

Can anyone tell me where am I doing wrong?

3
  • 3
    .match only searches for a match at the start of the string. Use str.contains(r'/\d{6}/') to find entries containing / + 6 digits + / Commented Jul 14, 2020 at 15:48
  • check with .find or contains? Commented Jul 14, 2020 at 15:49
  • @WiktorStribiżew It is working with contains. Thanks Commented Jul 14, 2020 at 15:51

2 Answers 2

5

str.match only searches for a match at the start of the string. So, if you want to match / + 6 digits + / somewhere inside the string using str.match, you would need to use one of

df=df[df['name'].str.match(r'.*/\d{6}/')]      # assuming the match is closer to the end of the string
df=df[df['name'].str.match(r'(?s).*/\d{6}/')]  # same, but allows a multiline search
df=df[df['name'].str.match(r'.*?/\d{6}/')]     # assuming the match is closer to the start of the string
df=df[df['name'].str.match(r'(?s).*?/\d{6}/')] # same, but allows a multiline search

However, it is more reasonable and efficient here to use str.contains with a regex like

df=df[df['name'].str.contains(r'/\d{6}/')]

to find entries containing / + 6 digits + /.

Or, to make sure you just match 6 digit chunks and not 7+ digit chunks:

df=df[df['name'].str.contains(r'(?<!\d)\d{6}(?!\d)')]

where

  • (?<!\d) - makes sure there is no digit on the left
  • \d{6} - any six digits
  • (?!\d) - no digit on the right is allowed.
Sign up to request clarification or add additional context in comments.

1 Comment

Just FYI, especially for regex novices: r'/\d{6}/' != r'\d{6}'!!! The forward slashes are part of the regex pattern here and there will only be a match if there are literal / chars on both sides of the 6-digit substring. Do not confuse regex declarations using a string literal (like here) and regex literal (as in perl, JavaScript, Ruby and some other languages) where supported where / are used as regex delimiters.
0

You are almost there, use str.contains instead:

df[df['name'].str.contains(r'\d{6,}')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.