0

I have a dataframe like this:

pd.DataFrame({'course_code': ['BUS225 - DC - 02-21-17', 
                          'N320L - EM8 - 01-21-20 - Sect1', 'N495 - LA8 - 05-14-19 - Sect3']})

I am trying to write a regular expression (with pandas) that returns me the following output:

pd.DataFrame({'course_code': ['BUS225', 'N320L', 'N495']})

At the moment here is my code:

df.course_code.str.extract(r'(\A\D\D\D\d\d\d)')

I know I'm missing something here. I'm having a hard timing capturing the "L", as well as dealing with course codes that have 3 alphas at the beginning of the string vs 1 alpha.

5
  • 1
    Use df.course_code.str.split(r' - ').str[0] Commented Oct 13, 2020 at 15:14
  • 1
    Or, df.course_code.str.extract(r'^([A-Z]+\d+[A-Z]*)') Commented Oct 13, 2020 at 15:22
  • Thank you Wiktor! Commented Oct 13, 2020 at 15:28
  • 1
    Just a reminder, the proposed answer uses pandas' string methods since regex are not required here, being quite a straight forward pandas manipulation problem. I'd appreciate an explanation on why my answer is wrong, since I'm struggling to see myself. Let's please use the voting feature correctly and with common sense. Commented Oct 13, 2020 at 15:43
  • 1
    Every single regex tagged answer I make, gets downvoted, what a coincidence. What a poor example of correct behaviour in SO Commented Oct 13, 2020 at 15:44

2 Answers 2

1

Splitting on the first occurrence of the delimiter ' - ' and keeping the first element should be enough:

df['course_code'] = df.course_code.str.split(' - ', n=1, expand=True)[0]

print(df)
  course_code
0      BUS225
1       N320L
2        N495
Sign up to request clarification or add additional context in comments.

5 Comments

Could you help me with solving this in regex? I'm trying hard to learn this and need help. Your answer is correct and I will credit you.
Can you specify the structure? Those substrings don't match (\A\D\D\D\d\d\d) @MichaelMathewsJr.
Yea, you can forget my code it doesn't work. I want that exact output but using regex instead.
The problem is that I need to know how to match that substring, so you'd have to explain what logic to use @MichaelMathewsJr.
Right now, seems like df.course_code.str.extract(r'^(\w+) ')[0] should be enough @MichaelMathewsJr.
0

You can use lambda expression with split function on series. In your problem spliting by " - " works fine and you dont need to find a regex for it:

df = df.assign(course_code = lambda x: x['course_code'].apply(lambda s: s.split(' - ')[0]))

If you want a regex you should explain what would be the structure of first part of your string that you want.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.