While the OP did not explicitly say so (and did not complain), their mention of "sort of 'master' dataframe" imply to me that the column number of the resulting dataframe should not change.
This is the case for BENY's and cfort's answers if and only if the column names of the dataframes to append are guaranteed to be strict supersets of the column names of the original dataframe.
Only Eliot K's solution ensures that the width of the original dataframe is retained, filling up any missing columns in the dataframe to append with NaNs. If you come across the question with this particular concern, like I did, then this is the way to go.
In the following I will summarize the three solutions and show what they do.
Prepare test data
Let's first create some dataframes, starting with the reference dataframe:
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [4, 5], 'C': [8, 6]})
This would look like this:
A B C
0 1 4 8
1 2 5 6
Now let's create a set of dataframes to append:
df_union: has exactly the same columns as df
df_subset_: has some of the columns of df, but not all
df_intersection: has some, but not all, columns of df, and some additional ones
df_disjoint: has none of the columns of df
df_superset: has all of the columns of df and some additional columns
df_union = df.copy()
df_subset = pd.DataFrame({'A': [1, 2], 'B': [4, 5]})
df_intersection = pd.DataFrame({'C': [8, 6], 'D': [1, 2], 'E': [4, 5]})
df_disjoint = pd.DataFrame({'D': [1, 2], 'E': [4, 5], 'F': [8, 6]})
df_superset = df.merge(df_disjoint, left_index=True, right_index=True)
We end up wit the following column names per dataframe:
| Dataframe |
A |
B |
C |
D |
E |
F |
df |
YES |
YES |
YES |
|
|
|
df_union |
YES |
YES |
YES |
|
|
|
df_subset |
YES |
YES |
|
|
|
|
df_intersection |
|
|
YES |
YES |
YES |
|
df_disjoint |
|
|
|
YES |
YES |
YES |
df_superset |
YES |
YES |
YES |
YES |
YES |
YES |
Let's now use the test data
Inner join (BENY)
pd.concat([df, df_union], join='inner')
pd.concat([df, df_subset], join='inner')
pd.concat([df, df_intersection], join='inner')
pd.concat([df, df_disjoint], join='inner')
pd.concat([df, df_superset], join='inner')
Using this strategy, we cannot get more columns than what was in the original dataframe, but we can get less (including none at all).
When concatenating each of the test dataframes to df, we get the following columns for df_out:
| Dataframe |
A |
B |
C |
df_union |
YES |
YES |
YES |
df_subset |
YES |
YES |
|
df_intersection |
|
|
YES |
df_disjoint |
|
|
|
df_superset |
YES |
YES |
YES |
Conclusion:
- Prevents expansion of the reference dataframe; any additional columns in dataframe to append are dropped
- Does not prevent shrinkage of the reference dataframe
- Good solution if dataframe shrinkage is acceptable or if it is guaranteed that dataframes to append have at least all of the columns in the reference dataframe
Appending only columns in reference dataframe cfort
df.append(df_union[df.columns], ignore_index=True)
df.append(df_subset[df.columns], ignore_index=True)
df.append(df_intersection[df.columns], ignore_index=True)
df.append(df_disjoint[df.columns], ignore_index=True)
df.append(df_superset[df.columns], ignore_index=True)
This solution is similar to the inner join from above, with one important distinction: The operation will raise a KeyError if the dataframe to append does not contain all of the columns in the reference dataframe.
Appending the following dataframes works, and the df_out dataframe will contain all of the columns in the reference dataframe:
| Dataframe |
A |
B |
C |
df_union |
YES |
YES |
YES |
df_superset |
YES |
YES |
YES |
For the following dataframes, a KeyError is raised:
| Dataframe |
Outcome |
df_subset |
KeyError |
df_intersection |
KeyError |
df_disjoint |
KeyError |
Conclusion:
- Prevents expansion of the reference dataframe; any additional columns in dataframe to append are dropped
- Prevents shrinkage of the reference dataframe by raising a
KeyError
- Good solution if dataframe resizing is not acceptable;
KeyError can be used to handle dataframes with missing columns relative to the reference dataframe
Appending only columns in both dataframes Eliot K
df.append(df_union[list(set(df.columns) & set(df_union.columns))], ignore_index=True)
df.append(df_subset[list(set(df.columns) & set(df_subset.columns))], ignore_index=True)
df.append(df_intersection[list(set(df.columns) & set(df_intersection.columns))], ignore_index=True)
df.append(df_disjoint[list(set(df.columns) & set(df_disjoint.columns))], ignore_index=True)
df.append(df_superset[list(set(df.columns) & set(df_superset.columns))], ignore_index=True)
This solution uses the same technical approach as the previous one, but instead of selecting from the dataframe to append all of the columns of the reference dataframe (which may not be available), only those columns are selected that are available in both dataframes.
In this way, the solution is similar to the inner join, but because of .append()'s behavior, the resulting dataframe cannot shrink. Instead, missing data is filled in with NaNs (possibly even the entire row).
Let's see how that looks like for df_out:
| Dataframe |
A |
B |
C |
df_union |
YES |
YES |
YES |
df_subset |
YES |
YES |
NaN |
df_intersection |
NaN |
NaN |
YES |
df_disjoint |
NaN |
NaN |
NaN |
df_superset |
YES |
YES |
YES |
Conclusion:
- Prevents expansion of the reference dataframe; any additional columns in dataframe to append are dropped
- Prevents shrinkage of the reference dataframe by filling in
NaNs in missing columns of the appended dataframe
- Good solution if filling in missing data is and dataframe resizing is not acceptable
Alternative: Selecting reference columns after appending
A solution with the same behavior as the previous one would be to append the dataframe, then selecting the reference dataframe's columns afterwards:
df.append(df_to_append, ignore_index=True)[df.columns]
This solution is a bit easier to read and parse, but may be less performant on larger operations.
pd.concatwithjoint='inner'.