5

In the following code I want to get just the digits between '-' and 'u'. I thought i could apply regular expression non capturing groups format (?: … ) to ignore everything from '-' to the first digit. But output always include it. How can i use noncapturing groups format to generate correct ouput?

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract('((?:-[ ]*)[0-9]*)', expand=True)

enter image description here enter image description here

1
  • 1
    This is explained very well in this SO question Commented May 18, 2018 at 18:42

2 Answers 2

5

It isn't included in the inner group, but it's still included as part of the outer group. A non-capturing group does't necessarily imply it isn't captured at all... just that that group does not explicitly get saved in the output. It is still captured as part of any enclosing groups.

Just do not put them into the () that define the capturing:

import pandas as pd

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract(r'- ?(\d+)u', expand=True)

     0
0  428
1   68
2   58
3  318

That way you match anything that has a '-' in front (mabye followed by a aspace), a 'u' behind and numbers between the both.

Where,

-      # literal hyphen
\s?    # optional space—or you could go with \s* if you expect more than one
(\d+)  # capture one or more digits 
u      # literal "u"
Sign up to request clarification or add additional context in comments.

4 Comments

This returns a <input>:1: DeprecationWarning: invalid escape sequence \d. with compiler warnings turned on. I suggest you use raw strings.
@coldspeed very good suggestion - I just tested in pyfiddle and they do not show warnings. thx
Hmm, our patterns are ditto. I'll delete my answer ;-)
@coldspeed not quite - I used a space, forgetting about \s all the time
3

I think you're trying too complicated a regex. What about:

df['b'].str.extract(r'-(.*)u', expand=True)

      0
0   428
1    68
2    58
3   318

3 Comments

This also returns a DeprecationWarning with compiler warnings enabled, because your string isn't a raw-string.
Fair enough, am I right in saying that r'-(.*)u' would solve that? I'm not all that familiar with it TBH
Indeed, it would. ;-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.