python pandas merge multiple csv files

Question

I have around 600 csv file datasets, all have the very same column names [‘DateTime’, ‘Actual’, ‘Consensus’, ‘Previous’, ‘Revised’], all economic indicators and all-time series data sets.

the aim is to merge them all together in one csv file.

With ‘DateTime’ as an index.

The way I wanted this file to indexed in is the time line way which means let’s say the first event in the first csv dated in 12/18/2017 10:00:00 and first event in the second csv dated in 12/29/2017 09:00:00 and first event in the third csv dated in 12/20/2017 09:00:00.

So, I want to index them the later first and the newer after it, etc. despite the source csv it originally from.

I tried to merge just 3 of them as an experiment and the problem is the ‘DateTime’ because it prints the 3 of them together like this ('12/18/2017 10:00:00', '12/29/2017 09:00:00', '12/20/2017 09:00:00') Here is the code:

import pandas as pd


df1 = pd.read_csv("E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv")
df2 = pd.read_csv("E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv")
df3 = pd.read_csv("E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv")

df = pd.concat([df1, df2, df3], axis=1, join='inner')
df.set_index('DateTime', inplace=True)

print(df.head())
df.to_csv('df.csv')

can you give an example input and output?

Tai
– Tai

2018-01-01 18:37:13 +00:00
Commented Jan 1, 2018 at 18:37 — Tai
– Tai, Commented Jan 1, 2018 at 18:37

Parfait · Accepted Answer · 2018-01-01 19:31:03Z

17

Consider using read_csv() args, index_col and parse_dates, to create indices during import and format as datetime. Then run your needed horizontal merge. Below assumes date is in first column of csv. And at the end use sort_index() on final dataframe to sort the datetimes.

df1 = pd.read_csv(r"E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv",
                  index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv",
                  index_col=[0], parse_dates=[0])
df3 = pd.read_csv(r"E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv",
                  index_col=[0], parse_dates=[0])

finaldf = pd.concat([df1, df2, df3], axis=1, join='inner').sort_index()

And for DRY-er approach especially across the hundreds of csv files, use a list comprehension

import os
...
os.chdir('E:\\Business\\Economic Indicators')

dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
        for f in os.listdir(os.getcwd()) if f.endswith('csv')]

finaldf = pd.concat(dfs, axis=1, join='inner').sort_index()

edited Jan 1, 2018 at 19:31

answered Jan 1, 2018 at 19:25

Parfait

108k19 gold badges103 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sayed Gouda Over a year ago

thank you, i tried your code and the is empty set and i change the join arg to 'outer' and the code works

Parfait Over a year ago

Sounds good. I almost changed that per one of your comments. Glad to help. Happy coding and new year!

John Smith Optional · Accepted Answer · 2018-01-01 18:03:42Z

2

You're trying to build one large dataframe out of the rows of many dataframes who all have the same column names. axis should be 0 (the default), not 1. Also you don't need to specify a type of join. This will have no effect since the column names are the same for each dataframe.

df = pd.concat([df1, df2, df3])

should be enough in order to concatenate the datasets.

(see https://pandas.pydata.org/pandas-docs/stable/merging.html )

Your call to set_index to define an index using the values in the DateTime column should then work.

answered Jan 1, 2018 at 18:03

John Smith Optional

25.4k14 gold badges47 silver badges68 bronze badges

2 Comments

Sayed Gouda Over a year ago

you get me wrong bro, these are a different economic indicator with different time steps, if these file are 10 with 5 columns each that means I need 50 columns to keep them identifiable, so I can't simply blind them. I need a new data set contain all the readings from original sets, so if event 1 in time x and event 2 in time y if x and y in the same time add them at the same row, but each one in its column and if the time is different every one of them in its row and column then index them by time (later first and the newer at the end) despite the source data set.

John Smith Optional Over a year ago

Then you need to use set_index('DateTime', inplace=True) on the dataframes df1, df2, df3 before calling pd.concat.

marc_s · Accepted Answer · 2020-11-22 17:49:03Z

1

dataset_1 = pd.read_csv('csv path')
dataset_2 = pd.read_csv('csv path')
    
new_dataset = pd.merge(dataset_1, dataset_2, left_on='same column name', right_on=('same column name'), how=('how to join ex:left'))

edited Nov 22, 2020 at 17:49

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

answered Oct 31, 2020 at 19:56

ns_piumal

111 bronze badge

Comments

bolirev · Accepted Answer · 2018-01-01 19:02:38Z

The problem is two folds: merging the csv into a single dataframe, and then ordering it by date.

As John Smith pointed out to merge dataframes along rows, you need to use:

df = pd.concat([df1,df2,df3])

Then you want to set an index and reorder your dataframe according to the index.

df.set_index('DateTime', inplace=True)
df.sort_index(inplace=True)

or in descending order

df.sort_index(inplace=True,ascending=False)

(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)

timeindex = pd.date_range('2018/01/01','2018/01/10')
randtimeindex = np.random.permutation(timeindex)
# Create three dataframes
df1 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
                columns=['Actual','Consensus','DateTime'])
df1.DateTime=randtimeindex[:3]
df2 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
                columns=['Actual','Consensus','DateTime'])
df2.DateTime=randtimeindex[3:6]
df3 = pd.DataFrame(index=range(4),data=np.random.rand(4,3),
                columns=['Actual','Consensus','DateTime'])
df3.DateTime=randtimeindex[6:]

# Merge them
df4 = pd.concat([df1, df2, df3], axis=0)

# Reindex the merged dataframe, and sort it
df4.set_index('DateTime', inplace=True)
df4.sort_index(inplace=True, ascending=False)

print(df4.head())

Collectives™ on Stack Overflow

python pandas merge multiple csv files

4 Answers 4

2 Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related