16

i have multiple csv files saved in one folder with the same column layout and want to load it into python as a dataframe in pandas.

The question is really simliar to this thread.

I am using the following code:

import glob
import pandas as pd
salesdata = pd.DataFrame()
for f in glob.glob("TransactionData\Promorelevant\*.csv"):
    appenddata = pd.read_csv(f, header=None, sep=";")
    salesdata = salesdata.append(appenddata,ignore_index=True)

Is there a better solution for it with another package?

This is taking to much time.

Thanks

4 Answers 4

17

I suggest use list comprehension with concat:

import glob
import pandas as pd

files = glob.glob("TransactionData\Promorelevant*.csv")
dfs = [pd.read_csv(f, header=None, sep=";") for f in files]

salesdata = pd.concat(dfs,ignore_index=True)
Sign up to request clarification or add additional context in comments.

3 Comments

pd.read_csv can load data using file path, any specific reason for using glob?
@Shiva - Yes, glob return all filepaths, so it is necessary.
This is barely any different from the question? Though I guess concat can be faster than append (or does append use concat behind the scenes?). It could be better optimized as it is one operation I guess.
6

With a help from link to actual answer

This seems to be the best one liner:

import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "*.csv"))))

1 Comment

could you specify your solotion?
2

Maybe using bash will be faster:

head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
tail -q -n +2 TransactionData/Promorelevant*.csv >> merged.csv

Or if using from within a jupyter notebook

!head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
!tail -q -n +2 "TransactionData/Promorelevant*.csv" >> merged.csv

The idea being that you won't need to parse anything.

The first command copies the header of one of the files. You can skip this line if you don't have a header. Tail skips the headers for all the files and adds them to the csv.

Appending in Python is probably more expensive.

Of course, make sure your parse is still valid using pandas.

pd.read_csv("merged.csv")

Curious to your benchmark.

6 Comments

@PV8 What do you not understand? First line copies the header of one of the files to merged.csv, the second line appends all csvs but omitting their headers. Since it does not have to parse anything it will be lightning fast.
assuming my files in the folder are named: 0.csv, 1.csvand so on, and the folder path is still: TransactionData\Promorelevant, what do I have to write to use your code?
@PV8 Updated the example with the folder, does it work?
the 1 is identified as invalid syntax, I hav eto run this in jupyter notebook right?
@PV8 Updated to add quotes, the better slashes, and you have to have the ! in front if using jupyter. The server the notebook is running on is linux based right?
|
0

i checked all this approaches except the bash one with the time function (only one run, and also note that the files are on a shared drive).

Here are the results:

My approach: 1220.49

List comphrension+concat: 1135.53

concat+map+join: 1116.31

I will go for list comphrension+concat which will save me some minutes and i feel quite familiar with.

Thanks for your ideas.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.