2

I would like to read multiple CSV files (hundreds of files,hundreds of lines each but with the same number of columns) from a target directory into a single Python Pandas DataFrame.

The code below I wrote works but too slow.It takes minutes to run 30 files(so how long should I wait if I load all of my files). What can I alter to make it work faster?

Besides, in replacefunction, I want to replace a "_"(don't know the encoding, but not a normal one) to a "-"(normal utf-8), how can I do with that? I use coding=latin-1because I have french accents in the files.

#coding=latin-1

import pandas as pd
import glob

pd.set_option('expand_frame_repr', False)

path = r'D:\Python27\mypfe\data_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col = None, header = 0, sep = ';', dayfirst = True, 
                     parse_dates=['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'])
    df.drop(labels=['aPaye','MethodePaiement','ArgentPercu'],axis=1,inplace=True)
    df['Sens'].replace("\n", "-", inplace=True,regex=True)
    list_.append(df)

    print "fichier lu:",file_

frame = pd.concat(list_)

print frame
4
  • Can you just concatenate all files together before you ever get to Python? That way just one file gets read? An alternative would be to zip the files together, which will give you random access but with the benefits of the operating system's file cache. Commented Jul 12, 2016 at 13:41
  • I don't really understand how to concatenate before getting to Python. To make all the files into one csv file? And for zip,in which step? Commented Jul 12, 2016 at 13:47
  • Yes, make one big CSV file or make a zip file. These have nothing to do with Python; you'll have to do this externally to Python. Once you have one file with all of the data, then have your Python script read that file directly. Commented Jul 12, 2016 at 13:54
  • It sounds like a solution. But how to do that automatically? And how to make a zip file? I still prefer to do it inside python, since I don't know exactly how many files I have to read. Commented Jul 12, 2016 at 14:04

1 Answer 1

2

you may try the following - read only those columns that really need, use list comprehension and call pd.concat([ ... ], ignore_index=True) once, because it's pretty slow:

# there is no sense to read columns that you don't need
# specify the column list (EXCLUDING: 'aPaye','MethodePaiement','ArgentPercu')
cols = ['col1', 'col2', 'etc.']
date_cols = ['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective']

df = pd.concat(
        [pd.read_csv(f, sep = ';', dayfirst = True, usecols=cols, 
                     parse_dates=date_cols)
         for f in allFiles
        ],
        ignore_index=True
     )

this should work if you have enough memory to store two resulting DFs...

Sign up to request clarification or add additional context in comments.

2 Comments

it works for the date_cols, but when it comes to usecols, it returns an error: ValueError: Usecols do not match names. And if I use the number of column, it gives feedback: IndexError: list index out of range
well, you have to specify the column names that you need in the cols list - i think the error message ValueError: Usecols do not match names is pretty self-explanatory...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.