0

I am trying to read 3 years of data files (one for each date), and the portion I am interested is often quite small (total ~1.4 million rows), compared to the parent files (each about 90MB and 1.5 million rows). The below code has worked pretty good for me in the past with smaller number of files. But with 1095 files to process, it is crawling (taking about 3-4 seconds to read one file). Any suggestions for making this more efficient/fast?

import pandas as pd
from glob import glob

file_list = glob(r'C:\Temp2\dl*.csv') 
for file in file_list:
    print(file)
    df = pd.read_csv(file, header=None)
    df = df[[0,1,3,4,5]]
    df2 = df[df[0].isin(det_list)]  
    if file_list[0]==file:
        rawdf = df2
    else:
        rawdf = rawdf.append(df2)
2
  • you could specify the dtypes of the columns Commented Aug 14, 2017 at 18:42
  • thanks,@djk47463: the final df has 4 integers and 1 datetime field. would that improve the reading/procesing speed? Commented Aug 14, 2017 at 20:21

1 Answer 1

3

IIUC, try this:

import pandas as pd
from glob import glob

file_list = glob(r'C:\Temp2\dl*.csv')

cols = [0,1,3,4,5]

df = pd.concat([pd.read_csv(f, header=None, usecols=cols)
                  .add_prefix('c')
                  .query("c0 in @det_list") 
                for f in file_list],
               ignore_index=True)
Sign up to request clarification or add additional context in comments.

1 Comment

thanks, @MaxU. This code also pulls the data I need. But it still takes 30:24 min. I timed this run. Will rerun my original code later, and post the time in the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.