Efficiently reading multiple csv files into Pandas dataframe

Question

I am trying to read 3 years of data files (one for each date), and the portion I am interested is often quite small (total ~1.4 million rows), compared to the parent files (each about 90MB and 1.5 million rows). The below code has worked pretty good for me in the past with smaller number of files. But with 1095 files to process, it is crawling (taking about 3-4 seconds to read one file). Any suggestions for making this more efficient/fast?

import pandas as pd
from glob import glob

file_list = glob(r'C:\Temp2\dl*.csv') 
for file in file_list:
    print(file)
    df = pd.read_csv(file, header=None)
    df = df[[0,1,3,4,5]]
    df2 = df[df[0].isin(det_list)]  
    if file_list[0]==file:
        rawdf = df2
    else:
        rawdf = rawdf.append(df2)

thanks,@djk47463: the final df has 4 integers and 1 datetime field. would that improve the reading/procesing speed? — ram
– ram, Commented Aug 14, 2017 at 20:21

MaxU - stand with Ukraine · Accepted Answer · 2017-08-14 18:43:57Z

3

IIUC, try this:

import pandas as pd
from glob import glob

file_list = glob(r'C:\Temp2\dl*.csv')

cols = [0,1,3,4,5]

df = pd.concat([pd.read_csv(f, header=None, usecols=cols)
                  .add_prefix('c')
                  .query("c0 in @det_list") 
                for f in file_list],
               ignore_index=True)

answered Aug 14, 2017 at 18:43

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ram Over a year ago

thanks, @MaxU. This code also pulls the data I need. But it still takes 30:24 min. I timed this run. Will rerun my original code later, and post the time in the question.

Collectives™ on Stack Overflow

Efficiently reading multiple csv files into Pandas dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related