6

I have a sort of 'master' dataframe that I'd like to append only matching columns from another dataframe to

df:
  A  B  C
  1  2  3

df_to_append:
  A  B  C  D  E
  6  7  8  9  0

The problem is that when I use df.append(), It also appends the unmatched columns to df.

df = df.append(df_to_append, ignore_index=True)
Out:
A  B  C   D   E
1  2  3  NaN NaN
6  7  8   9   0

But my desired output is to drop columns D and E since they are not a part of the original dataframe? Perhaps I need to use pd.concat? I don't think I can use pd.merge since I don't have anything unique to merge on.

2
  • Consider using pd.concat with joint='inner'. Commented Apr 5, 2019 at 17:29
  • 4
    Why not use the list of column names from the "master" to filter the other before doing the append? Commented Apr 5, 2019 at 17:29

4 Answers 4

9

Using concat join='inner

pd.concat([df,df_to_append],join='inner')
Out[162]: 
   A  B  C
0  1  2  3
0  6  7  8
Sign up to request clarification or add additional context in comments.

Comments

3

Just select the columns common to both dfs:

df.append(df_to_append[df.columns], ignore_index=True)

Comments

3

The simplest way would be to get the list of columns common to both dataframes using df.columns, but if you don't know that all of the original columns are included in df_to_append, then you need to find the intersection of the two sets:

cols = list(set(df.columns) & set(df_to_append.columns)) 
df.append(df_to_append[cols], ignore_index=True)

Comments

0

While the OP did not explicitly say so (and did not complain), their mention of "sort of 'master' dataframe" imply to me that the column number of the resulting dataframe should not change.

This is the case for BENY's and cfort's answers if and only if the column names of the dataframes to append are guaranteed to be strict supersets of the column names of the original dataframe.

Only Eliot K's solution ensures that the width of the original dataframe is retained, filling up any missing columns in the dataframe to append with NaNs. If you come across the question with this particular concern, like I did, then this is the way to go.

In the following I will summarize the three solutions and show what they do.


Prepare test data

Let's first create some dataframes, starting with the reference dataframe:

import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [4, 5], 'C': [8, 6]})

This would look like this:

   A  B  C
0  1  4  8
1  2  5  6

Now let's create a set of dataframes to append:

  • df_union: has exactly the same columns as df
  • df_subset_: has some of the columns of df, but not all
  • df_intersection: has some, but not all, columns of df, and some additional ones
  • df_disjoint: has none of the columns of df
  • df_superset: has all of the columns of df and some additional columns
df_union = df.copy()
df_subset = pd.DataFrame({'A': [1, 2], 'B': [4, 5]})
df_intersection = pd.DataFrame({'C': [8, 6], 'D': [1, 2], 'E': [4, 5]})
df_disjoint = pd.DataFrame({'D': [1, 2], 'E': [4, 5], 'F': [8, 6]})
df_superset = df.merge(df_disjoint, left_index=True, right_index=True)

We end up wit the following column names per dataframe:

Dataframe A B C D E F
df YES YES YES
df_union YES YES YES
df_subset YES YES
df_intersection YES YES YES
df_disjoint YES YES YES
df_superset YES YES YES YES YES YES

Let's now use the test data


Inner join (BENY)

pd.concat([df, df_union], join='inner')
pd.concat([df, df_subset], join='inner')
pd.concat([df, df_intersection], join='inner')
pd.concat([df, df_disjoint], join='inner')
pd.concat([df, df_superset], join='inner')

Using this strategy, we cannot get more columns than what was in the original dataframe, but we can get less (including none at all).

When concatenating each of the test dataframes to df, we get the following columns for df_out:

Dataframe A B C
df_union YES YES YES
df_subset YES YES
df_intersection YES
df_disjoint
df_superset YES YES YES

Conclusion:

  • Prevents expansion of the reference dataframe; any additional columns in dataframe to append are dropped
  • Does not prevent shrinkage of the reference dataframe
  • Good solution if dataframe shrinkage is acceptable or if it is guaranteed that dataframes to append have at least all of the columns in the reference dataframe

Appending only columns in reference dataframe cfort

df.append(df_union[df.columns], ignore_index=True)
df.append(df_subset[df.columns], ignore_index=True)
df.append(df_intersection[df.columns], ignore_index=True)
df.append(df_disjoint[df.columns], ignore_index=True)
df.append(df_superset[df.columns], ignore_index=True)

This solution is similar to the inner join from above, with one important distinction: The operation will raise a KeyError if the dataframe to append does not contain all of the columns in the reference dataframe.

Appending the following dataframes works, and the df_out dataframe will contain all of the columns in the reference dataframe:

Dataframe A B C
df_union YES YES YES
df_superset YES YES YES

For the following dataframes, a KeyError is raised:

Dataframe Outcome
df_subset KeyError
df_intersection KeyError
df_disjoint KeyError

Conclusion:

  • Prevents expansion of the reference dataframe; any additional columns in dataframe to append are dropped
  • Prevents shrinkage of the reference dataframe by raising a KeyError
  • Good solution if dataframe resizing is not acceptable; KeyError can be used to handle dataframes with missing columns relative to the reference dataframe

Appending only columns in both dataframes Eliot K

df.append(df_union[list(set(df.columns) & set(df_union.columns))], ignore_index=True)
df.append(df_subset[list(set(df.columns) & set(df_subset.columns))], ignore_index=True)
df.append(df_intersection[list(set(df.columns) & set(df_intersection.columns))], ignore_index=True)
df.append(df_disjoint[list(set(df.columns) & set(df_disjoint.columns))], ignore_index=True)
df.append(df_superset[list(set(df.columns) & set(df_superset.columns))], ignore_index=True)

This solution uses the same technical approach as the previous one, but instead of selecting from the dataframe to append all of the columns of the reference dataframe (which may not be available), only those columns are selected that are available in both dataframes.

In this way, the solution is similar to the inner join, but because of .append()'s behavior, the resulting dataframe cannot shrink. Instead, missing data is filled in with NaNs (possibly even the entire row).

Let's see how that looks like for df_out:

Dataframe A B C
df_union YES YES YES
df_subset YES YES NaN
df_intersection NaN NaN YES
df_disjoint NaN NaN NaN
df_superset YES YES YES

Conclusion:

  • Prevents expansion of the reference dataframe; any additional columns in dataframe to append are dropped
  • Prevents shrinkage of the reference dataframe by filling in NaNs in missing columns of the appended dataframe
  • Good solution if filling in missing data is and dataframe resizing is not acceptable

Alternative: Selecting reference columns after appending

A solution with the same behavior as the previous one would be to append the dataframe, then selecting the reference dataframe's columns afterwards:

df.append(df_to_append, ignore_index=True)[df.columns]

This solution is a bit easier to read and parse, but may be less performant on larger operations.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.