1

I'm working with college basketball data. The two fields I have right now are the raw matchup and the predicted winner.

RawMatchup PredictedWinner
MinnesotaLouisville Louisville

I want to use the Predicted Winner to separate out the two teams in the RawMatchup column. Currently I'm using replace to remove the Predicted Winner from the RawMatchup.

RawMatchup.replace(PredictedWinner, '')
>>Minnesota

This works for the vast majority of the rows in my dataset. The problem I'm having is when both school's partially share a name

RawMatchup PredictedWinner
GeorgiaGeorgia Tech Georgia
North Carolina CentralNorth Carolina North Carolina

Using split for these two rows returns just 'Tech' and 'Central' (instead 'Georgia Tech' and 'North Carolina Central'). How can I best separate the Predicted Winner from the Raw Matchup while preserving the correct school names?

3
  • What would you want them to return, and by what logic do you arrive at that decision? Commented Dec 12, 2020 at 19:00
  • 1
    The raw matchup feels kinda odd. It's actually storing two fields in one field. You'd expect at least a semicolon, slash, or some other delimiter between the two. If you can't control how the RawMatchup is generated, then the most vivid pattern I see is to split the RawMatchup manually (i.e. with a loop) by looking for adjacent lower-case and upper-case letters (e.g. GeorgiaGeorgia –> Georgia Georgia; CentralNorth –> Central North). Once you've performed that cutoff, it becomes clear which one to remove. Commented Dec 12, 2020 at 19:07
  • Yeah the RawMatchup is just part of the files I'm working with - I didn't generate it on my own. PredictedWinner was basically the same - one field that stored about 5 different values. Luckily that one was easy enough to split. Commented Dec 12, 2020 at 19:38

1 Answer 1

1

I wouldn't use split because IMO it's intended for a different purpose (usually splitting the elements by standard separators such as commas, or whitespaces). In this case, what you want is removing PredictedWinner from RawMatchup only once. Therefore I'd go for replace and sub to achieve the goal.

It seems that PredictedWinner is either at the end or at the beginning of RawMatchup. We could take advantage of that to define the following function:

import re

def remove_winner_from_raw(raw_matchup, predicted_winner):
    if (raw_matchup.endswith(predicted_winner)):
        res = re.sub(f"{predicted_winner}$", '', raw_matchup) # regexp
    else:
        res = raw_matchup.replace(predicted_winner, '', 1) # Just the 1st occurrence
    return res

print(remove_winner_from_raw("North Carolina CentralNorth Carolina", "North Carolina"))
# Output: North Carolina Central

print(remove_winner_from_raw("GeorgiaGeorgia Tech", "Georgia"))
# Output: Georgia Tech

Docs for:

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.