Using pandas extract regex with multiple groups

Question

I am trying to extract a number from a pandas series of strings. For example consider this series:

s = pd.Series(['a-b-1', 'a-b-2', 'c1-d-5', 'c1-d-9', 'e-10-f-1-3.xl', 'e-10-f-2-7.s'])

0            a-b-1
1            a-b-2
2           c1-d-5
3           c1-d-9
4    e-10-f-1-3.xl
5     e-10-f-2-7.s
dtype: object

There are 6 rows, and three string formats/templates (known). The goal is to extract a number for each of the rows depending on the string. Here is what I came up with:

s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')

and this correctly extracts the numbers that I want from each row:

    0   1   2
0   1   NaN NaN
1   2   NaN NaN
2   NaN 5   NaN
3   NaN 9   NaN
4   NaN NaN 3
5   NaN NaN 7

However, since I have three groups in the regex, I have 3 columns, and here comes the question:

Can I write a regex that has one group or that can generate a single column, or do I need to coalesce the columns into one, and how can I do that without a loop if necessary?

Desired outcome would be a series like:

Do you mean like switch the alternatives (?:a-b|c1-d|e-10-f-[0-9])-([0-9]) regex101.com/r/GPFI94/1 — The fourth bird
– The fourth bird, Commented May 15, 2020 at 15:30
Yes, thats the answer I'm looking for, put it as an answer if you'd like. Thanks! — Gerges
– Gerges, Commented May 15, 2020 at 15:35

Quang Hoang · Accepted Answer · 2020-05-15 15:37:23Z

3

Simplest thing to do is bfill\ffill:

(s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')
  .bfill(axis=1)
  [0]
)

Output:

0    1
1    2
2    5
3    9
4    3
5    7
Name: 0, dtype: object

Another way is to use optional non-capturing group:

s.str.extract('(?:a-b-)?(?:c1-d-)?(?:e-10-f-[0-9]-)?([0-9])')

Output:

edited May 15, 2020 at 15:37

answered May 15, 2020 at 15:31

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

The fourth bird · Accepted Answer · 2020-05-15 15:46:15Z

2

You could use a single capturing group at the end, and add the 3 prefixes in a on capturing group (?:

As they all end with a hyphen, you could move that to after the non capturing group to shorted it a bit.

(?:a-b|c1-d|e-10-f-[0-9])-([0-9])

Regex demo

s.str.extract('(?:a-b|c1-d|e-10-f-[0-9])-([0-9])')

Ouput

edited May 15, 2020 at 15:46

answered May 15, 2020 at 15:37

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Collectives™ on Stack Overflow

Using pandas extract regex with multiple groups

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related