Append only matching columns to dataframe

Question

I have a sort of 'master' dataframe that I'd like to append only matching columns from another dataframe to

df:
  A  B  C
  1  2  3

df_to_append:
  A  B  C  D  E
  6  7  8  9  0

The problem is that when I use df.append(), It also appends the unmatched columns to df.

df = df.append(df_to_append, ignore_index=True)
Out:
A  B  C   D   E
1  2  3  NaN NaN
6  7  8   9   0

But my desired output is to drop columns D and E since they are not a part of the original dataframe? Perhaps I need to use pd.concat? I don't think I can use pd.merge since I don't have anything unique to merge on.

Why not use the list of column names from the "master" to filter the other before doing the append? — roganjosh
– roganjosh, Commented Apr 5, 2019 at 17:29

BENY · Accepted Answer · 2019-04-05 17:58:12Z

9

Using concat join='inner

pd.concat([df,df_to_append],join='inner')
Out[162]: 
   A  B  C
0  1  2  3
0  6  7  8

answered Apr 5, 2019 at 17:58

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

cfort · Accepted Answer · 2019-04-05 17:31:48Z

3

Just select the columns common to both dfs:

df.append(df_to_append[df.columns], ignore_index=True)

answered Apr 5, 2019 at 17:31

cfort

2,7951 gold badge22 silver badges31 bronze badges

Comments

Eliot · Accepted Answer · 2019-04-05 17:39:42Z

3

The simplest way would be to get the list of columns common to both dataframes using df.columns, but if you don't know that all of the original columns are included in df_to_append, then you need to find the intersection of the two sets:

cols = list(set(df.columns) & set(df_to_append.columns)) 
df.append(df_to_append[cols], ignore_index=True)

answered Apr 5, 2019 at 17:39

Eliot

6325 silver badges18 bronze badges

Comments

uniqueg · Accepted Answer · 2023-04-16 15:51:06Z

While the OP did not explicitly say so (and did not complain), their mention of "sort of 'master' dataframe" imply to me that the column number of the resulting dataframe should not change.

This is the case for BENY's and cfort's answers if and only if the column names of the dataframes to append are guaranteed to be strict supersets of the column names of the original dataframe.

Only Eliot K's solution ensures that the width of the original dataframe is retained, filling up any missing columns in the dataframe to append with NaNs. If you come across the question with this particular concern, like I did, then this is the way to go.

In the following I will summarize the three solutions and show what they do.

Prepare test data

Let's first create some dataframes, starting with the reference dataframe:

import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [4, 5], 'C': [8, 6]})

This would look like this:

   A  B  C
0  1  4  8
1  2  5  6

Now let's create a set of dataframes to append:

df_union: has exactly the same columns as df
df_subset_: has some of the columns of df, but not all
df_intersection: has some, but not all, columns of df, and some additional ones
df_disjoint: has none of the columns of df
df_superset: has all of the columns of df and some additional columns

df_union = df.copy()
df_subset = pd.DataFrame({'A': [1, 2], 'B': [4, 5]})
df_intersection = pd.DataFrame({'C': [8, 6], 'D': [1, 2], 'E': [4, 5]})
df_disjoint = pd.DataFrame({'D': [1, 2], 'E': [4, 5], 'F': [8, 6]})
df_superset = df.merge(df_disjoint, left_index=True, right_index=True)

We end up wit the following column names per dataframe:

Dataframe	A	B	C	D	E	F
`df`	YES	YES	YES
`df_union`	YES	YES	YES
`df_subset`	YES	YES
`df_intersection`			YES	YES	YES
`df_disjoint`				YES	YES	YES
`df_superset`	YES	YES	YES	YES	YES	YES

Let's now use the test data

Inner join (BENY)

pd.concat([df, df_union], join='inner')
pd.concat([df, df_subset], join='inner')
pd.concat([df, df_intersection], join='inner')
pd.concat([df, df_disjoint], join='inner')
pd.concat([df, df_superset], join='inner')

Using this strategy, we cannot get more columns than what was in the original dataframe, but we can get less (including none at all).

When concatenating each of the test dataframes to df, we get the following columns for df_out:

Dataframe	A	B	C
`df_union`	YES	YES	YES
`df_subset`	YES	YES
`df_intersection`			YES
`df_disjoint`
`df_superset`	YES	YES	YES

Conclusion:

Prevents expansion of the reference dataframe; any additional columns in dataframe to append are dropped
Does not prevent shrinkage of the reference dataframe
Good solution if dataframe shrinkage is acceptable or if it is guaranteed that dataframes to append have at least all of the columns in the reference dataframe

Appending only columns in reference dataframe cfort

df.append(df_union[df.columns], ignore_index=True)
df.append(df_subset[df.columns], ignore_index=True)
df.append(df_intersection[df.columns], ignore_index=True)
df.append(df_disjoint[df.columns], ignore_index=True)
df.append(df_superset[df.columns], ignore_index=True)

This solution is similar to the inner join from above, with one important distinction: The operation will raise a KeyError if the dataframe to append does not contain all of the columns in the reference dataframe.

Appending the following dataframes works, and the df_out dataframe will contain all of the columns in the reference dataframe:

Dataframe	A	B	C
`df_union`	YES	YES	YES
`df_superset`	YES	YES	YES

For the following dataframes, a KeyError is raised:

Dataframe	Outcome
`df_subset`	`KeyError`
`df_intersection`	`KeyError`
`df_disjoint`	`KeyError`

Conclusion:

Prevents expansion of the reference dataframe; any additional columns in dataframe to append are dropped
Prevents shrinkage of the reference dataframe by raising a KeyError
Good solution if dataframe resizing is not acceptable; KeyError can be used to handle dataframes with missing columns relative to the reference dataframe

Appending only columns in both dataframes Eliot K

df.append(df_union[list(set(df.columns) & set(df_union.columns))], ignore_index=True)
df.append(df_subset[list(set(df.columns) & set(df_subset.columns))], ignore_index=True)
df.append(df_intersection[list(set(df.columns) & set(df_intersection.columns))], ignore_index=True)
df.append(df_disjoint[list(set(df.columns) & set(df_disjoint.columns))], ignore_index=True)
df.append(df_superset[list(set(df.columns) & set(df_superset.columns))], ignore_index=True)

This solution uses the same technical approach as the previous one, but instead of selecting from the dataframe to append all of the columns of the reference dataframe (which may not be available), only those columns are selected that are available in both dataframes.

In this way, the solution is similar to the inner join, but because of .append()'s behavior, the resulting dataframe cannot shrink. Instead, missing data is filled in with NaNs (possibly even the entire row).

Let's see how that looks like for df_out:

Dataframe	A	B	C
`df_union`	YES	YES	YES
`df_subset`	YES	YES	`NaN`
`df_intersection`	`NaN`	`NaN`	YES
`df_disjoint`	`NaN`	`NaN`	`NaN`
`df_superset`	YES	YES	YES

Conclusion:

Prevents expansion of the reference dataframe; any additional columns in dataframe to append are dropped
Prevents shrinkage of the reference dataframe by filling in NaNs in missing columns of the appended dataframe
Good solution if filling in missing data is and dataframe resizing is not acceptable

Alternative: Selecting reference columns after appending

A solution with the same behavior as the previous one would be to append the dataframe, then selecting the reference dataframe's columns afterwards:

df.append(df_to_append, ignore_index=True)[df.columns]

This solution is a bit easier to read and parse, but may be less performant on larger operations.

Collectives™ on Stack Overflow

Append only matching columns to dataframe

4 Answers 4

Comments

Comments

Comments

Prepare test data

Inner join (BENY)

Appending only columns in reference dataframe cfort

Appending only columns in both dataframes Eliot K

Alternative: Selecting reference columns after appending

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Prepare test data

Inner join (BENY)

Appending only columns in reference dataframe cfort

Appending only columns in both dataframes Eliot K

Alternative: Selecting reference columns after appending

Comments

Your Answer

Sign up or log in

Post as a guest

Related