Python Merge Pandas Dataframe

Question

I am new to Python and am looking for a simple solution.

I have several .csv files with the same structure (number of columns and lines) in one folder. The path is: C:\temp

Now I want to read all these .csv files into a new dataframe, which I want to export later as a new .csv file.

up to now i have read each .csv file by hand and saved it into a pandas dataframe.

Here is an example:

df1 = pd.read_csv(r "C:\temp\df1.csv", header= None)
df2 = pd.read_csv(r "C:\temp\df2.csv", header= None)

df1

0 id Feature
1 1 12
2 2 13
3 3 14
4 4 15
5 5 16
6 7 17
7 8 15
8 9 12
9 10 13
10 11 23

Then I used .append to merge the dataframes.

df_new = df1.append(df2)

0   id  Feature
1   1   12
2   2   13
3   3   14
4   4   15
5   5   16
6   7   17
7   8   15
8   9   12
9   10  13
10  11  23
0   id  Feature
1   1   14
2   2   9
3   3   3
4   4   8
5   5   9
6   7   1
7   8   32
8   9   7
9   10  3
10  11  12

df_new.to_csv('df_new.csv', index=False)

Unfortunately this version always has the header with me, but I don't need it. So I deleted it afterwards by hand.

Isn't there a faster version? I'm thinking of a for loop that opens all existing .csv files in the path and reads them line by line into a new dataframe and at the end of the loop makes a .csv file out of it? Unfortunately I have no experience with loops.

I appreciate your help.

Erik van de Ven · Accepted Answer · 2020-10-12 18:13:20Z

1

You could do something like this:

frames = []
for i in range(0,10):
    frames.append(pd.read_csv(r "C:\temp\df%d.csv" % i))

df = pd.concat(frames)

Which should concatenate all csv files into one dataframe and which works if all filenames are structured like df1d.csv, df2d.csv etc. and you know the amount of files. If you just want to loop over all files in the folder, you could do something like this:

import glob

path = "C:\temp\*.csv"

frames = []
for fname in glob.glob(path):
    frames.append(pd.read_csv(fname))

df = pd.concat(frames)

The extended version, without a loop would be:

df1 = pd.read_csv(r "C:\temp\df1.csv")
df2 = pd.read_csv(r "C:\temp\df2.csv")
df3 = pd.read_csv(r "C:\temp\df3.csv")
frames = [df1, df2, df3]
df = pd.concat(frames)

Afterwards you could save the dataframe back to a csv file:

df.to_csv(file_name)

edited Oct 12, 2020 at 18:13

answered Oct 12, 2020 at 18:00

Erik van de Ven

5,0156 gold badges47 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Skruff Over a year ago

These solutions are exactly what I was looking for! Thanks a lot! Amazing Community!

thelogicalkoan · Accepted Answer · 2020-10-12 18:10:42Z

In [1]: import pandas as pd

In [2]: from io import StringIO

In [3]: df = pd.read_csv(StringIO("""0 id Feature
   ...: 1 1 12
   ...: 2 2 13
   ...: 3 3 14
   ...: 4 4 15
   ...: 5 5 16
   ...: 6 7 17
   ...: 7 8 15
   ...: 8 9 12
   ...: 9 10 13
   ...: 10 11 23"""), sep=' ')

In [4]: df1 = pd.read_csv(StringIO("""0   id  Feature
   ...: 1   1   14
   ...: 2   2   9
   ...: 3   3   3
   ...: 4   4   8
   ...: 5   5   9
   ...: 6   7   1
   ...: 7   8   32
   ...: 8   9   7
   ...: 9   10   3
   ...: 10   11   12"""), sep='   ')

In [10]: pd.concat([df, df1])
Out[10]: 
    0  id  Feature
0   1   1       12
1   2   2       13
2   3   3       14
3   4   4       15
4   5   5       16
5   6   7       17
6   7   8       15
7   8   9       12
8   9  10       13
9  10  11       23
0   1   1       14
1   2   2        9
2   3   3        3
3   4   4        8
4   5   5        9
5   6   7        1
6   7   8       32
7   8   9        7
8   9  10        3
9  10  11       12

In [11]: %timeit pd.concat([df, df1])

188 µs ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: df.append(df1)
Out[14]: 
    0  id  Feature
0   1   1       12
1   2   2       13
2   3   3       14
3   4   4       15
4   5   5       16
5   6   7       17
6   7   8       15
7   8   9       12
8   9  10       13
9  10  11       23
0   1   1       14
1   2   2        9
2   3   3        3
3   4   4        8
4   5   5        9
5   6   7        1
6   7   8       32
7   8   9        7
8   9  10        3
9  10  11       12

In [15]: %timeit df.append(df1)
197 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

With pandas version '1.1.3'

You can clearly check that pd.concat is faster than df.append(df1).

For working with loops, you can create a variable with the filenames and keep a list of dataframes from those files using for loop, something like this

filename = ['1.csv', '2.csv']

dfs = []

for name in filename:
    dfs.append(pd.read_csv(name))

new_df = pd.concat(dfs)

This is easy, efficient, cleaner and faster as well.

And then save the file to csv.

new_df.to_csv(out_filename)

These solutions are exactly what I was looking for! Thanks a lot! Amazing Community!

Collectives™ on Stack Overflow

Python Merge Pandas Dataframe

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related