15

I have around 600 csv file datasets, all have the very same column names [‘DateTime’, ‘Actual’, ‘Consensus’, ‘Previous’, ‘Revised’], all economic indicators and all-time series data sets.

the aim is to merge them all together in one csv file.

With ‘DateTime’ as an index.

The way I wanted this file to indexed in is the time line way which means let’s say the first event in the first csv dated in 12/18/2017 10:00:00 and first event in the second csv dated in 12/29/2017 09:00:00 and first event in the third csv dated in 12/20/2017 09:00:00.

So, I want to index them the later first and the newer after it, etc. despite the source csv it originally from.

I tried to merge just 3 of them as an experiment and the problem is the ‘DateTime’ because it prints the 3 of them together like this ('12/18/2017 10:00:00', '12/29/2017 09:00:00', '12/20/2017 09:00:00') Here is the code:

import pandas as pd


df1 = pd.read_csv("E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv")
df2 = pd.read_csv("E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv")
df3 = pd.read_csv("E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv")

df = pd.concat([df1, df2, df3], axis=1, join='inner')
df.set_index('DateTime', inplace=True)

print(df.head())
df.to_csv('df.csv')
1
  • can you give an example input and output? Commented Jan 1, 2018 at 18:37

4 Answers 4

17

Consider using read_csv() args, index_col and parse_dates, to create indices during import and format as datetime. Then run your needed horizontal merge. Below assumes date is in first column of csv. And at the end use sort_index() on final dataframe to sort the datetimes.

df1 = pd.read_csv(r"E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv",
                  index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv",
                  index_col=[0], parse_dates=[0])
df3 = pd.read_csv(r"E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv",
                  index_col=[0], parse_dates=[0])

finaldf = pd.concat([df1, df2, df3], axis=1, join='inner').sort_index()

And for DRY-er approach especially across the hundreds of csv files, use a list comprehension

import os
...
os.chdir('E:\\Business\\Economic Indicators')

dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
        for f in os.listdir(os.getcwd()) if f.endswith('csv')]

finaldf = pd.concat(dfs, axis=1, join='inner').sort_index()
Sign up to request clarification or add additional context in comments.

2 Comments

thank you, i tried your code and the is empty set and i change the join arg to 'outer' and the code works
Sounds good. I almost changed that per one of your comments. Glad to help. Happy coding and new year!
2

You're trying to build one large dataframe out of the rows of many dataframes who all have the same column names. axis should be 0 (the default), not 1. Also you don't need to specify a type of join. This will have no effect since the column names are the same for each dataframe.

df = pd.concat([df1, df2, df3])

should be enough in order to concatenate the datasets.

(see https://pandas.pydata.org/pandas-docs/stable/merging.html )

Your call to set_index to define an index using the values in the DateTime column should then work.

2 Comments

you get me wrong bro, these are a different economic indicator with different time steps, if these file are 10 with 5 columns each that means I need 50 columns to keep them identifiable, so I can't simply blind them. I need a new data set contain all the readings from original sets, so if event 1 in time x and event 2 in time y if x and y in the same time add them at the same row, but each one in its column and if the time is different every one of them in its row and column then index them by time (later first and the newer at the end) despite the source data set.
Then you need to use set_index('DateTime', inplace=True) on the dataframes df1, df2, df3 before calling pd.concat.
1
dataset_1 = pd.read_csv('csv path')
dataset_2 = pd.read_csv('csv path')
    
new_dataset = pd.merge(dataset_1, dataset_2, left_on='same column name', right_on=('same column name'), how=('how to join ex:left'))

Comments

0

The problem is two folds: merging the csv into a single dataframe, and then ordering it by date.

As John Smith pointed out to merge dataframes along rows, you need to use:

df = pd.concat([df1,df2,df3])

Then you want to set an index and reorder your dataframe according to the index.

df.set_index('DateTime', inplace=True)
df.sort_index(inplace=True)

or in descending order

df.sort_index(inplace=True,ascending=False)

(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)


timeindex = pd.date_range('2018/01/01','2018/01/10')
randtimeindex = np.random.permutation(timeindex)
# Create three dataframes
df1 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
                columns=['Actual','Consensus','DateTime'])
df1.DateTime=randtimeindex[:3]
df2 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
                columns=['Actual','Consensus','DateTime'])
df2.DateTime=randtimeindex[3:6]
df3 = pd.DataFrame(index=range(4),data=np.random.rand(4,3),
                columns=['Actual','Consensus','DateTime'])
df3.DateTime=randtimeindex[6:]

# Merge them
df4 = pd.concat([df1, df2, df3], axis=0)

# Reindex the merged dataframe, and sort it
df4.set_index('DateTime', inplace=True)
df4.sort_index(inplace=True, ascending=False)

print(df4.head())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.