Match columns and append to data frame, Python 3.6

Question

I have around 50 excel files & I want to import to dataframe and merge all files into single dataframe. But some file has 3 some are 4 columns. Every file as different columns in different order.

Total distinct column from all the files: 5 i.e col1, col2, col3, col4, col5

I know how to import but while appending facing issue.

Script:

dfAll = pd.DataFrame(columns=['col1', 'col2', 'col3', 'col4', 'col5')]
df= pd.read_excel('FilePath', sheetname='data1') # contains 3 columns i.e col1, col2, col5
columnsOFdf = df.columns
dfAll[columnsOFdf] = dfAll.append(df)

but its giving error "ValueError: Columns must be same length as key"

I want to append df['col1','col2','col5'] data to dfAll['col1','col2','col5']

Please help on this issue.

You have are trying to append a dataframe of size 3 to a dataframe of size 5, that is not going to work with untype datasets — Sentinel
– Sentinel, Commented Sep 6, 2017 at 14:04
I'm not well referenced in using python in terms of dataFrames, you will most likely need to make a new dataframe including only the columns you want, than append the other dataframe — Sentinel
– Sentinel, Commented Sep 6, 2017 at 14:11

Alexander · Accepted Answer · 2017-09-06 14:21:14Z

9

Concatenation will match your columns

dfs = []
files = [...]
for file_name in files:
    dfs.append(pd.read_excel(file_name, sheetname='data1'))
df = pd.concat(dfs)

df1 = pd.DataFrame(np.random.randn(3, 3), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randn(3, 3), columns=list('BCD'))
>>> pd.concat([df1, df2])
          A         B         C         D
0 -2.329280  0.644155 -0.835137       NaN
1  0.666496 -1.299048  0.111579       NaN
2  1.855494 -0.085850 -0.541890       NaN
0       NaN -1.131514  1.023610 -0.514384
1       NaN  0.670063  1.403143 -0.978611
2       NaN -0.314741 -0.727200 -0.620511

In addition, each time you append a dataframe to an existing one, it returns a copy. This will seriously degrade performance and is referred to as a quadratic copy. You are best of creating a list of all dataframes and then concatenating the result.

edited Sep 6, 2017 at 14:21

answered Sep 6, 2017 at 14:14

Alexander

110k32 gold badges212 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jundiaius · Accepted Answer · 2017-09-06 14:10:08Z

2

One solution is to add empty columns to the dataframes you load from Excel files:

columns = ['col1', 'col2', 'col3', 'col4', 'col5']
dfAll = pd.DataFrame(columns=columns)
df= pd.read_excel('FilePath', sheetname='data1') # contains 3 columns i.e             col1, col2, col5
columnsOFdf = df.columns
for column in columns:
    if column not in columnsOFdf:
        df[column] = [""] * df.shape[0]
dfAll.append(df)

answered Sep 6, 2017 at 14:10

Jundiaius

7,9985 gold badges39 silver badges49 bronze badges

1 Comment

Sentinel Over a year ago

This is a good workaround if you need to keep the extra columns with the new data. Maybe a placeholder instead of just an empty space would be a good idea. But if the data isn't required, I'd suggest creating a new dataframe

Jorge Alberto Rueda Flores · Accepted Answer · 2017-09-06 14:11:52Z

2

try this:

[dfAll.append(i) for i in df]

I hope this help you.

answered Sep 6, 2017 at 14:11

Jorge Alberto Rueda Flores

1748 bronze badges

2 Comments

question.it Over a year ago

Error: TypeError: cannot concatenate a non-NDFrame object

question.it Over a year ago

type(dfAll) >> <class 'pandas.core.frame.DataFrame'> , type(df) >> <class 'pandas.core.frame.DataFrame'>

Collectives™ on Stack Overflow

Match columns and append to data frame, Python 3.6

3 Answers 3

Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related