0

reading bulk CSV_FILE, i have no problem if the file has 5 millions number of rows, but the problem if trying to run this code on massive file around 300 million rows, but it doesn't work with me , is there any way to enhance the code or chunk function that enhance the response time

import pandas as pd
import timeit
df = pd.read_csv('/home/mahmoudod/Desktop/to_dict/text1.txt'
                ,dtype='unicode'
                ,index_col=False
                ,error_bad_lines=False
                ,sep = ';'
                ,low_memory = False
                ,names =['DATE'
                ,'IMSI'
                ,'WEBSITE'
                ,'LINKUP'
                ,'LINKDOWN'
                ,'COUNT'
                ,'CONNECTION']

                 )
#df.DATE = pd.to_datetime(df.DATE)
group = df.groupby(['IMSI','WEBSITE']).agg({'DATE':[min,max]
    ,'LINKUP':'sum'
    , 'LINKDOWN':'sum'
    , 'COUNT':'max'
    ,'CONNECTION':'sum'
            })
group.to_csv('/home/mahmoudod/Desktop/to_dict/output.txt')
0

1 Answer 1

2

One solution is offered by dask.dataframe, which chunks internally:

import dask.dataframe as dd

df = dd.read_csv(...)
group = df.groupby(...).aggregate({...}).compute()
group.to_csv('output.txt')

This isn't tested. I suggest you read the documentation to familiarize yourself with the syntax. The important point to understand is dd.read_csv does not read the whole file in memory and no operations are processed until compute is called, at which point dask processes in constant memory via chunking.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.