Speeding up read_csv in python pandas

Question

I am trying to parse a huge csv file (around 50 million rows) using Pandas 'read_csv' method.

Below is the code snippet I am using:

df_chunk = pd.read_csv(db_export_file, delimiter='~!#', engine='python', header=None, keep_default_na=False, na_values=[''], chunksize=10 ** 6, iterator=True)

Therafter using the pd.concat method I am getting the whole set of dataframe which is used for further processing.

Everything is working fine instead, the read operation from that csv file takes almost 6 mins to create the dataframe.

My question is that, is there any other way to make this process faster using the same module and method?

Below is the sample data presented as a csv file

155487~!#-64721487465~!#A1_NUM~!#1.000 155487~!#-45875722734~!#A32_ENG~!#This is a good facility 458448~!#-14588001153~!#T12_Timing~!#23-02-2015 14:50:30 458448~!#-10741214586~!#Q2_56!#

Thanks in advance

We would need more information, e.g. what types are the columns, do you have repeated string data, are they properly typed (e.g. no mixed numeric & non-numeric data), etc. Showing us a sample would help. — jpp
– jpp, Commented Jul 30, 2018 at 9:21
I think your best choice is split the csv and then read all chunks using multiprocessing: stackoverflow.com/questions/36587211/… — BFajardo
– BFajardo, Commented Jul 30, 2018 at 9:28
Hi, I have updated with a sample file data, actually its a mixed type. — Sourajit Roy Chowdhury
– Sourajit Roy Chowdhury, Commented Jul 30, 2018 at 9:33

BFajardo · Accepted Answer · 2018-07-31 08:13:31Z

I think your best choice is split the csv

split -l LINES_PER_FILE YOUR.CSV OUTPUT_NAMES

and then read all chunks using multiprocessing. You have an example here:

import os
import pandas as pd 
from multiprocessing import Pool

# wrap your csv importer in a function that can be mapped
def read_csv(filename):
    'converts a filename to a pandas dataframe'
    return pd.read_csv(filename)


def main():
    # set up your pool
    pool = Pool(processes=8) # or whatever your hardware can support

    # get a list of file names
    files = os.listdir('.')
    file_list = [filename for filename in files if filename.split('.')[1]=='csv']

    # have your pool map the file names to dataframes
    df_list = pool.map(read_csv, file_list)

    # reduce the list of dataframes to a single dataframe
    combined_df = pd.concat(df_list, ignore_index=True)

if __name__ == '__main__':
    main()

Kots · Accepted Answer · 2022-02-08 09:09:57Z

0

My case and how it was solved

I had a similarly huge dataset and a custom converter was mandatory to be implemented. The pandas.read_csv() was taking ages because of the custom converter.

The solution for me was to use modin. It was simple, just had to change the import on top and everything else was done automatically.

Take a look at the page: https://github.com/modin-project/modin

answered Feb 8, 2022 at 9:09

Kots

5051 gold badge6 silver badges25 bronze badges

Collectives™ on Stack Overflow

Speeding up read_csv in python pandas

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related