Pythonic Way to Split A String Based On SubString

Question

I'm working with college basketball data. The two fields I have right now are the raw matchup and the predicted winner.

RawMatchup	PredictedWinner
MinnesotaLouisville	Louisville

I want to use the Predicted Winner to separate out the two teams in the RawMatchup column. Currently I'm using replace to remove the Predicted Winner from the RawMatchup.

RawMatchup.replace(PredictedWinner, '')
>>Minnesota

This works for the vast majority of the rows in my dataset. The problem I'm having is when both school's partially share a name

RawMatchup	PredictedWinner
GeorgiaGeorgia Tech	Georgia
North Carolina CentralNorth Carolina	North Carolina

Using split for these two rows returns just 'Tech' and 'Central' (instead 'Georgia Tech' and 'North Carolina Central'). How can I best separate the Predicted Winner from the Raw Matchup while preserving the correct school names?

What would you want them to return, and by what logic do you arrive at that decision? — Scott Hunter
– Scott Hunter, Commented Dec 12, 2020 at 19:00
The raw matchup feels kinda odd. It's actually storing two fields in one field. You'd expect at least a semicolon, slash, or some other delimiter between the two. If you can't control how the RawMatchup is generated, then the most vivid pattern I see is to split the RawMatchup manually (i.e. with a loop) by looking for adjacent lower-case and upper-case letters (e.g. GeorgiaGeorgia –> Georgia Georgia; CentralNorth –> Central North). Once you've performed that cutoff, it becomes clear which one to remove. — TrebledJ
– TrebledJ, Commented Dec 12, 2020 at 19:07
Yeah the RawMatchup is just part of the files I'm working with - I didn't generate it on my own. PredictedWinner was basically the same - one field that stored about 5 different values. Luckily that one was easy enough to split. — BenjaminFranklinGates
– BenjaminFranklinGates, Commented Dec 12, 2020 at 19:38

Turtlean · Accepted Answer · 2020-12-12 20:02:40Z

I wouldn't use split because IMO it's intended for a different purpose (usually splitting the elements by standard separators such as commas, or whitespaces). In this case, what you want is removing PredictedWinner from RawMatchup only once. Therefore I'd go for replace and sub to achieve the goal.

It seems that PredictedWinner is either at the end or at the beginning of RawMatchup. We could take advantage of that to define the following function:

import re

def remove_winner_from_raw(raw_matchup, predicted_winner):
    if (raw_matchup.endswith(predicted_winner)):
        res = re.sub(f"{predicted_winner}$", '', raw_matchup) # regexp
    else:
        res = raw_matchup.replace(predicted_winner, '', 1) # Just the 1st occurrence
    return res

print(remove_winner_from_raw("North Carolina CentralNorth Carolina", "North Carolina"))
# Output: North Carolina Central

print(remove_winner_from_raw("GeorgiaGeorgia Tech", "Georgia"))
# Output: Georgia Tech

Docs for:

str.replace: https://docs.python.org/3/library/stdtypes.html
re.sub: https://docs.python.org/3/library/re.html

Collectives™ on Stack Overflow

Pythonic Way to Split A String Based On SubString

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related