2

My head is hurting after read post after post on this, and i cannot get the gist of how to solve this.

I have 2 pandas dataframes containing sports matches (simplified here):

A: Date, HomeTeam, AwayTeam
B: Date, HomeTeam, AwayTeam, HomeScore, AwayScore

A and B must merge into A.

Acontains more matches than B, and thus, Ais larger than BThe size of Amust be preserved (consider Aour "master").

Bmust fill in the HomeScore and AwayScore for each row where Date, HomeTeam and AwayTeam matches.

How can i merge these two properly?

I have considered using iterrows() or panda conditions pd[(a == b)], but I can not see how to solve it.

1
  • can you share your create statement for pandas dataframe? Commented Jul 30, 2019 at 8:29

2 Answers 2

2

You can use the merge() using the option how = 'left' to specify that you want to do a left join keeping the rows of A.

Here is what it could look like :

A = pd.DataFrame({'Date' : ['2019-06-12', '2019-08-06', '2019-08-06'],
                  'HomeTeam' : ['Team A', 'Team B', 'Team C'],
                  'AwayTeam' : ['Team D', 'Team E', 'Team F']})
B = pd.DataFrame({'Date' : ['2019-06-12', '2019-08-06'],
                  'HomeTeam' : ['Team A', 'Team B'],
                  'AwayTeam' : ['Team D', 'Team E'],
                  'HomeScore' : [54, 64], 'AwayScore' : [12, 16]})

A.merge(B, on = ['Date', 'HomeTeam', 'AwayTeam'], how = 'left')

Output :

         Date HomeTeam AwayTeam  HomeScore  AwayScore
0  2019-06-12   Team A   Team D       54.0       12.0
1  2019-08-06   Team B   Team E       64.0       16.0
2  2019-08-06   Team C   Team F        NaN        NaN
Sign up to request clarification or add additional context in comments.

Comments

2

You can use pd.DataFrame.join

idxs = ['Date', 'HomeTeam', 'AwayTeam']
joined = A.set_index(idxs).join(B.set_index(idxs), how='left').reset_index()

This will produce a dataframe with as many rows as in A but with extra columns which will get the value from B or be NaN if the corresponding combination of values of idx doesn't appear there. You are performing a left join, using SQL terms.

5 Comments

It seems to be working. Is there a way to specify that only the "HomeScore" and "AwayScore" should be joined? (The dataset contains more columns)
Use B[['HomeScore', 'AwayScore']] instead of plain B
When i do, i get this: AttributeError: 'Series' object has no attribute 'set_index'
I had a typo in the comment, now it's fixed. Are you sure you are indexing B with a list of column names? This shouldn't output a pd.Series but a new pd.DataFrame instead. Anyway, if the problem persists, consider posting a new question where you can fully explain the details
I solved it by doing: b_df = b_df[["Date", "HomeTeam", "AwayTeam", "HomeScore", "AwayScore"]]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.