speed up the process of import multiple csv into python dataframe

Question

I would like to read multiple CSV files (hundreds of files,hundreds of lines each but with the same number of columns) from a target directory into a single Python Pandas DataFrame.

The code below I wrote works but too slow.It takes minutes to run 30 files(so how long should I wait if I load all of my files). What can I alter to make it work faster?

Besides, in replacefunction, I want to replace a "_"(don't know the encoding, but not a normal one) to a "-"(normal utf-8), how can I do with that? I use coding=latin-1because I have french accents in the files.

#coding=latin-1

import pandas as pd
import glob

pd.set_option('expand_frame_repr', False)

path = r'D:\Python27\mypfe\data_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col = None, header = 0, sep = ';', dayfirst = True, 
                     parse_dates=['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'])
    df.drop(labels=['aPaye','MethodePaiement','ArgentPercu'],axis=1,inplace=True)
    df['Sens'].replace("\n", "-", inplace=True,regex=True)
    list_.append(df)

    print "fichier lu:",file_

frame = pd.concat(list_)

print frame

Can you just concatenate all files together before you ever get to Python? That way just one file gets read? An alternative would be to zip the files together, which will give you random access but with the benefits of the operating system's file cache. — chrisaycock
– chrisaycock, Commented Jul 12, 2016 at 13:41
I don't really understand how to concatenate before getting to Python. To make all the files into one csv file? And for zip,in which step? — ch36r5s
– ch36r5s, Commented Jul 12, 2016 at 13:47
Yes, make one big CSV file or make a zip file. These have nothing to do with Python; you'll have to do this externally to Python. Once you have one file with all of the data, then have your Python script read that file directly. — chrisaycock
– chrisaycock, Commented Jul 12, 2016 at 13:54
It sounds like a solution. But how to do that automatically? And how to make a zip file? I still prefer to do it inside python, since I don't know exactly how many files I have to read. — ch36r5s
– ch36r5s, Commented Jul 12, 2016 at 14:04

MaxU - stand with Ukraine · Accepted Answer · 2016-07-12 15:08:50Z

2

you may try the following - read only those columns that really need, use list comprehension and call pd.concat([ ... ], ignore_index=True) once, because it's pretty slow:

# there is no sense to read columns that you don't need
# specify the column list (EXCLUDING: 'aPaye','MethodePaiement','ArgentPercu')
cols = ['col1', 'col2', 'etc.']
date_cols = ['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective']

df = pd.concat(
        [pd.read_csv(f, sep = ';', dayfirst = True, usecols=cols, 
                     parse_dates=date_cols)
         for f in allFiles
        ],
        ignore_index=True
     )

this should work if you have enough memory to store two resulting DFs...

answered Jul 12, 2016 at 15:08

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ch36r5s Over a year ago

it works for the date_cols, but when it comes to usecols, it returns an error: ValueError: Usecols do not match names. And if I use the number of column, it gives feedback: IndexError: list index out of range

MaxU - stand with Ukraine Over a year ago

well, you have to specify the column names that you need in the cols list - i think the error message ValueError: Usecols do not match names is pretty self-explanatory...

Collectives™ on Stack Overflow

speed up the process of import multiple csv into python dataframe

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related