Pandas regex - conditional matching

Question

I have a dataframe like this:

pd.DataFrame({'course_code': ['BUS225 - DC - 02-21-17', 
                          'N320L - EM8 - 01-21-20 - Sect1', 'N495 - LA8 - 05-14-19 - Sect3']})

I am trying to write a regular expression (with pandas) that returns me the following output:

pd.DataFrame({'course_code': ['BUS225', 'N320L', 'N495']})

At the moment here is my code:

df.course_code.str.extract(r'(\A\D\D\D\d\d\d)')

I know I'm missing something here. I'm having a hard timing capturing the "L", as well as dealing with course codes that have 3 alphas at the beginning of the string vs 1 alpha.

Just a reminder, the proposed answer uses pandas' string methods since regex are not required here, being quite a straight forward pandas manipulation problem. I'd appreciate an explanation on why my answer is wrong, since I'm struggling to see myself. Let's please use the voting feature correctly and with common sense. — yatu
– yatu, Commented Oct 13, 2020 at 15:43
Every single regex tagged answer I make, gets downvoted, what a coincidence. What a poor example of correct behaviour in SO — yatu
– yatu, Commented Oct 13, 2020 at 15:44

yatu · Accepted Answer · 2020-10-13 15:49:59Z

1

Splitting on the first occurrence of the delimiter ' - ' and keeping the first element should be enough:

df['course_code'] = df.course_code.str.split(' - ', n=1, expand=True)[0]

print(df)
  course_code
0      BUS225
1       N320L
2        N495

edited Oct 13, 2020 at 15:49

answered Oct 13, 2020 at 15:18

yatu

88.6k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Michael Mathews Jr. Over a year ago

Could you help me with solving this in regex? I'm trying hard to learn this and need help. Your answer is correct and I will credit you.

yatu Over a year ago

Can you specify the structure? Those substrings don't match (\A\D\D\D\d\d\d) @MichaelMathewsJr.

Michael Mathews Jr. Over a year ago

Yea, you can forget my code it doesn't work. I want that exact output but using regex instead.

yatu Over a year ago

The problem is that I need to know how to match that substring, so you'd have to explain what logic to use @MichaelMathewsJr.

yatu Over a year ago

Right now, seems like df.course_code.str.extract(r'^(\w+) ')[0] should be enough @MichaelMathewsJr.

Mehdi Golzadeh · Accepted Answer · 2020-10-13 15:27:33Z

0

You can use lambda expression with split function on series. In your problem spliting by " - " works fine and you dont need to find a regex for it:

df = df.assign(course_code = lambda x: x['course_code'].apply(lambda s: s.split(' - ')[0]))

If you want a regex you should explain what would be the structure of first part of your string that you want.

answered Oct 13, 2020 at 15:27

Mehdi Golzadeh

2,5931 gold badge18 silver badges28 bronze badges

Collectives™ on Stack Overflow

Pandas regex - conditional matching

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related