1

I have a data frame similar to the one listed below. For some reason, each team is listed twice, one listing corresponding to each column.

import pandas as pd
import numpy as np
d = {'Team': ['1', '2', '3', '1', '2', '3'], 'Points for': [5, 10, 15, np.nan,np.nan,np.nan], 'Points against' : [np.nan,np.nan,np.nan, 3, 6, 9]}
df = pd.DataFrame(data=d)




Team    Points for  Points against
0   1        5            Nan
1   2       10            Nan
2   3       15            Nan
3   1       Nan            3
4   2       Nan            6
5   3       Nan            9

How can I just combine rows of duplicate team names so that there are no missing values? This is what I would like:

 Team   Points for  Points against
0   1        5             3
1   2       10             6
2   3       15             9

I have been trying to figure it out with pandas, but can't seem to get it. Thanks!

3
  • Does this answer your question? How to combine duplicate rows in pandas? Commented Apr 12, 2020 at 2:41
  • Just remove all the Nans from your input and remove the duplicate index values: d = {'Team': ['1', '2', '3'], 'Points for': [5, 10, 15], 'Points against' : [3, 6, 9]}. Or are you saying the data comes to you in this dirty format and you want help cleaning it? Ideally you'd fix whatever code produces this dirty data. Commented Apr 12, 2020 at 2:42
  • Unfortunately this is the way the data is for some odd reason. Commented Apr 12, 2020 at 2:44

4 Answers 4

1

I made changes to your code, replacing string 'Nan' with numpy's nan.

One solution is to melt the data, drop the null entries, and pivot back to wide from long:

df = (df
      .melt('Team')
      .dropna()
      .pivot('Team','variable','value')
      .reset_index()
      .rename_axis(None,axis='columns')
      .astype(int)
     )

df


  Team  Points against  Points for
0   1      3              5
1   2      6              10
2   3      9              15
Sign up to request clarification or add additional context in comments.

Comments

0

One way using groupby. :

df = df.replace("Nan", np.nan)
new_df = df.groupby("Team").first()
print(new_df)

Output:

      Points for  Points against
Team                            
1            5.0             3.0
2           10.0             6.0
3           15.0             9.0

Comments

0

You need to groupby the unique identifiers. If there is also a game ID or date or something like that, you might need to group on that as well.

df.groupby('Team').agg({'Points for': 'max', 'Points against': 'max'})

Comments

0
pd.pivot_table(df, values = ['Points for','Points against'],index=['Team'], aggfunc=np.sum)[['Points for','Points against']]

Output

      Points for  Points against
Team                            
1            5.0             3.0
2           10.0             6.0
3           15.0             9.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.