7

I have around 50 excel files & I want to import to dataframe and merge all files into single dataframe. But some file has 3 some are 4 columns. Every file as different columns in different order.

Total distinct column from all the files: 5 i.e col1, col2, col3, col4, col5

I know how to import but while appending facing issue.

Script:

dfAll = pd.DataFrame(columns=['col1', 'col2', 'col3', 'col4', 'col5')]
df= pd.read_excel('FilePath', sheetname='data1') # contains 3 columns i.e col1, col2, col5
columnsOFdf = df.columns
dfAll[columnsOFdf] = dfAll.append(df)

but its giving error "ValueError: Columns must be same length as key"

I want to append df['col1','col2','col5'] data to dfAll['col1','col2','col5']

Please help on this issue.

3
  • 2
    You have are trying to append a dataframe of size 3 to a dataframe of size 5, that is not going to work with untype datasets Commented Sep 6, 2017 at 14:04
  • @Sentinel, thanks for response, any alternate solution? Commented Sep 6, 2017 at 14:07
  • I'm not well referenced in using python in terms of dataFrames, you will most likely need to make a new dataframe including only the columns you want, than append the other dataframe Commented Sep 6, 2017 at 14:11

3 Answers 3

9

Concatenation will match your columns

dfs = []
files = [...]
for file_name in files:
    dfs.append(pd.read_excel(file_name, sheetname='data1'))
df = pd.concat(dfs)

df1 = pd.DataFrame(np.random.randn(3, 3), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randn(3, 3), columns=list('BCD'))
>>> pd.concat([df1, df2])
          A         B         C         D
0 -2.329280  0.644155 -0.835137       NaN
1  0.666496 -1.299048  0.111579       NaN
2  1.855494 -0.085850 -0.541890       NaN
0       NaN -1.131514  1.023610 -0.514384
1       NaN  0.670063  1.403143 -0.978611
2       NaN -0.314741 -0.727200 -0.620511

In addition, each time you append a dataframe to an existing one, it returns a copy. This will seriously degrade performance and is referred to as a quadratic copy. You are best of creating a list of all dataframes and then concatenating the result.

Sign up to request clarification or add additional context in comments.

Comments

2

One solution is to add empty columns to the dataframes you load from Excel files:

columns = ['col1', 'col2', 'col3', 'col4', 'col5']
dfAll = pd.DataFrame(columns=columns)
df= pd.read_excel('FilePath', sheetname='data1') # contains 3 columns i.e             col1, col2, col5
columnsOFdf = df.columns
for column in columns:
    if column not in columnsOFdf:
        df[column] = [""] * df.shape[0]
dfAll.append(df)

1 Comment

This is a good workaround if you need to keep the extra columns with the new data. Maybe a placeholder instead of just an empty space would be a good idea. But if the data isn't required, I'd suggest creating a new dataframe
2

try this:

[dfAll.append(i) for i in df]

I hope this help you.

2 Comments

Error: TypeError: cannot concatenate a non-NDFrame object
type(dfAll) >> <class 'pandas.core.frame.DataFrame'> , type(df) >> <class 'pandas.core.frame.DataFrame'>

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.