2

I am trying to extract a number from a pandas series of strings. For example consider this series:

s = pd.Series(['a-b-1', 'a-b-2', 'c1-d-5', 'c1-d-9', 'e-10-f-1-3.xl', 'e-10-f-2-7.s'])

0            a-b-1
1            a-b-2
2           c1-d-5
3           c1-d-9
4    e-10-f-1-3.xl
5     e-10-f-2-7.s
dtype: object

There are 6 rows, and three string formats/templates (known). The goal is to extract a number for each of the rows depending on the string. Here is what I came up with:

s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')

and this correctly extracts the numbers that I want from each row:

    0   1   2
0   1   NaN NaN
1   2   NaN NaN
2   NaN 5   NaN
3   NaN 9   NaN
4   NaN NaN 3
5   NaN NaN 7

However, since I have three groups in the regex, I have 3 columns, and here comes the question:

Can I write a regex that has one group or that can generate a single column, or do I need to coalesce the columns into one, and how can I do that without a loop if necessary?

Desired outcome would be a series like:

0   1
1   2
2   5
3   9
4   3
5   7
2
  • 1
    Do you mean like switch the alternatives (?:a-b|c1-d|e-10-f-[0-9])-([0-9]) regex101.com/r/GPFI94/1 Commented May 15, 2020 at 15:30
  • 1
    Yes, thats the answer I'm looking for, put it as an answer if you'd like. Thanks! Commented May 15, 2020 at 15:35

2 Answers 2

3

Simplest thing to do is bfill\ffill:

(s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')
  .bfill(axis=1)
  [0]
)

Output:

0    1
1    2
2    5
3    9
4    3
5    7
Name: 0, dtype: object

Another way is to use optional non-capturing group:

s.str.extract('(?:a-b-)?(?:c1-d-)?(?:e-10-f-[0-9]-)?([0-9])')

Output:

   0
0  1
1  2
2  5
3  9
4  3
5  7
Sign up to request clarification or add additional context in comments.

Comments

2

You could use a single capturing group at the end, and add the 3 prefixes in a on capturing group (?:

As they all end with a hyphen, you could move that to after the non capturing group to shorted it a bit.

(?:a-b|c1-d|e-10-f-[0-9])-([0-9])

Regex demo

s.str.extract('(?:a-b|c1-d|e-10-f-[0-9])-([0-9])')

Ouput

   0
0  1
1  2
2  5
3  9
4  3
5  7

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.