use pandas to handle massive csv file

Question

reading bulk CSV_FILE, i have no problem if the file has 5 millions number of rows, but the problem if trying to run this code on massive file around 300 million rows, but it doesn't work with me , is there any way to enhance the code or chunk function that enhance the response time

import pandas as pd
import timeit
df = pd.read_csv('/home/mahmoudod/Desktop/to_dict/text1.txt'
                ,dtype='unicode'
                ,index_col=False
                ,error_bad_lines=False
                ,sep = ';'
                ,low_memory = False
                ,names =['DATE'
                ,'IMSI'
                ,'WEBSITE'
                ,'LINKUP'
                ,'LINKDOWN'
                ,'COUNT'
                ,'CONNECTION']

                 )
#df.DATE = pd.to_datetime(df.DATE)
group = df.groupby(['IMSI','WEBSITE']).agg({'DATE':[min,max]
    ,'LINKUP':'sum'
    , 'LINKDOWN':'sum'
    , 'COUNT':'max'
    ,'CONNECTION':'sum'
            })
group.to_csv('/home/mahmoudod/Desktop/to_dict/output.txt')

jpp · Accepted Answer · 2019-01-21 14:20:56Z

2

One solution is offered by dask.dataframe, which chunks internally:

import dask.dataframe as dd

df = dd.read_csv(...)
group = df.groupby(...).aggregate({...}).compute()
group.to_csv('output.txt')

This isn't tested. I suggest you read the documentation to familiarize yourself with the syntax. The important point to understand is dd.read_csv does not read the whole file in memory and no operations are processed until compute is called, at which point dask processes in constant memory via chunking.

answered Jan 21, 2019 at 14:20

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

use pandas to handle massive csv file

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related