1

I am trying to parse a huge csv file (around 50 million rows) using Pandas 'read_csv' method.

Below is the code snippet I am using:

df_chunk = pd.read_csv(db_export_file, delimiter='~!#', engine='python', header=None, keep_default_na=False, na_values=[''], chunksize=10 ** 6, iterator=True)

Therafter using the pd.concat method I am getting the whole set of dataframe which is used for further processing.

Everything is working fine instead, the read operation from that csv file takes almost 6 mins to create the dataframe.

My question is that, is there any other way to make this process faster using the same module and method?

Below is the sample data presented as a csv file

155487~!#-64721487465~!#A1_NUM~!#1.000 155487~!#-45875722734~!#A32_ENG~!#This is a good facility 458448~!#-14588001153~!#T12_Timing~!#23-02-2015 14:50:30 458448~!#-10741214586~!#Q2_56!#

Thanks in advance

3
  • We would need more information, e.g. what types are the columns, do you have repeated string data, are they properly typed (e.g. no mixed numeric & non-numeric data), etc. Showing us a sample would help. Commented Jul 30, 2018 at 9:21
  • 1
    I think your best choice is split the csv and then read all chunks using multiprocessing: stackoverflow.com/questions/36587211/… Commented Jul 30, 2018 at 9:28
  • Hi, I have updated with a sample file data, actually its a mixed type. Commented Jul 30, 2018 at 9:33

2 Answers 2

4

I think your best choice is split the csv

split -l LINES_PER_FILE YOUR.CSV OUTPUT_NAMES

and then read all chunks using multiprocessing. You have an example here:

import os
import pandas as pd 
from multiprocessing import Pool

# wrap your csv importer in a function that can be mapped
def read_csv(filename):
    'converts a filename to a pandas dataframe'
    return pd.read_csv(filename)


def main():
    # set up your pool
    pool = Pool(processes=8) # or whatever your hardware can support

    # get a list of file names
    files = os.listdir('.')
    file_list = [filename for filename in files if filename.split('.')[1]=='csv']

    # have your pool map the file names to dataframes
    df_list = pool.map(read_csv, file_list)

    # reduce the list of dataframes to a single dataframe
    combined_df = pd.concat(df_list, ignore_index=True)

if __name__ == '__main__':
    main()
Sign up to request clarification or add additional context in comments.

Comments

0

My case and how it was solved

I had a similarly huge dataset and a custom converter was mandatory to be implemented. The pandas.read_csv() was taking ages because of the custom converter.

The solution for me was to use modin. It was simple, just had to change the import on top and everything else was done automatically.

Take a look at the page: https://github.com/modin-project/modin

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.