3

Given three dataframes:

df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4], 'C': 'dog'})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3], 'C': 'dog'})
df3 = pd.DataFrame({'A': [2, 1], 'B': [5, 1], 'C': 'dog'})

how can one combine them into a single dataframe, by adding the values of a subset of given dataframes, such that the result becomes:

pd.DataFrame({'A': [8, 2], 'B': [10, 8], 'C': 'dog'})

for this example? My problem is that I also have columns which are identical, but cannot be summed (like 'C' here).

5
  • are there only 3 dfs or more? Commented Jul 17, 2019 at 12:36
  • in my case only 3 Commented Jul 17, 2019 at 12:36
  • Also what happens if col C has different data in 3 dfs Commented Jul 17, 2019 at 12:37
  • In the case of same value its not problematic. But what is the decision if one is dog and one is cat? Commented Jul 17, 2019 at 12:38
  • then we put ['dog', 'cat'] in a list, on that row Commented Jul 17, 2019 at 12:44

2 Answers 2

4

One possible solution with sum if numeric values and if strings then join unique values per groups in GroupBy.agg after concat list of DataFrames:

f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ','.join(x.unique())
df = pd.concat([df1, df2, df3], keys=range(3)).groupby(level=1).agg(f)
print (df)
   A   B    C
0  8  10  dog
1  2   8  dog

If possible different values like cat and dog:

df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4], 'C': 'dog'})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3], 'C': 'dog'})
df3 = pd.DataFrame({'A': [2, 1], 'B': [5, 1], 'C': ['cat','dog']})


f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ','.join(x.unique())
df = pd.concat([df1, df2, df3], keys=range(3)).groupby(level=1).agg(f)
print (df)
   A   B        C
0  8  10  dog,cat
1  2   8      dog

If need lists:

f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else x.unique().tolist()
df = pd.concat([df1, df2, df3], keys=range(3)).groupby(level=1).agg(f)
print (df)
   A   B           C
0  8  10  [dog, cat]
1  2   8       [dog]

And for combination lists with scalars for nonnumeric values use custom function:

def f(x):
    if np.issubdtype(x.dtype, np.number):
        return x.sum()
    else:
        u = x.unique().tolist()
        if len(u) == 1:
            return u[0]
        else:
            return u

df = pd.concat([df1, df2, df3], keys=range(3)).groupby(level=1).agg(f)
print (df)
   A   B           C
0  8  10  [dog, cat]
1  2   8         dog
Sign up to request clarification or add additional context in comments.

4 Comments

I applied it to my real dataframes and from 3 dataframes of 5 rows × 67 columns, I get one with 5 rows × 700 columns, why does this happen?
@Qubix - columns names are different in each DataFrame?
you are right, they are. So can I assume that the last block of code above sums-up the columns that are common and leaves the ones that are not common with the same value they had in their original dataframe?
@Qubix - hmmm, concat join together, but if different columns then no alignment data (need same columns in each for create one column after concat with same name and filled by data from each df1, df2, df3). So the best here is try normalize and unify columns names for same in each df1, df2, df3
0

you can do it as follows:

df = df3.copy()
df[['A','B']] = df1[['A','B']]+df2[['A','B']]+df3[['A','B']]

gives the following output, if you want you can:

:df

    A   B   C
0   8   10  dog
1   2   8   dog

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.