Optimize processing of large CSV file Python

Question

I have a CSV file of about 175 millions lines (2.86 GB), composed of three columns as shown below :

I need to get the value in column "val" given "ID1" and "ID2". I query this dataframe constantly with varying combination of ID1 and ID2, which are unique in the whole file.

I have tried to use pandas as shown below, but results are taking a lot of time.

def is_av(Qterm, Cterm, df):
    try:
        return df.loc[(Qterm, Cterm),'val']
    except KeyError:
        return 0

Is there a faster way to access CSV values, knowing that this value is located in one single row of the whole file. If not could you check this function and tell me what might be the issue of slow processing

for nc in L:#ID1
    score = 0.0
    for ni in id_list:#ID2
        e = is_av(ni,nc,df_g)
        InDegree = df1.loc[ni].values[0]
        SumInMap = df2.loc[nc].values[0]
        score = score + term_score(InDegree, SumInMap, e) #compute a score 
    key  = pd_df3.loc[nc].values[0]
    tmt[key] = score

if you need to keep looking up stuff in this data then perhaps porting it to some sort of SQL engine will make the process easier, or try using df.query — Nullman
– Nullman, Commented May 29, 2022 at 13:04
how about summing the scores should I save all the corresponding scores for id_list and then do the sum once for each loop of id_list — Hamda Slimi
– Hamda Slimi, Commented May 29, 2022 at 13:14
could I use multiprocessing since I think I can convert the lines under the nested loop into a function that takes the two lists and returns a dict containing the scores — Hamda Slimi
– Hamda Slimi, Commented May 29, 2022 at 14:11

Mohammad Ayoub · Accepted Answer · 2022-05-29 14:48:00Z

TL;DR: Use a DBMS (I suggest MySQL or PostgreSQL). Pandas is definitely not suited for this sort of work. Dask is better, but not as good as a traditional DBMS.

The absolute best way of doing this would be to use SQL, consider MySQL or PostgreSQL for starters (both free and very efficient alternatives for your current use case). While Pandas is an incredibly strong library, when it comes to indexing and quick reading, this is not something it excels at, given that it needs to either load data into memory, or stream over the data with little control compared to a DBMS.

Consider your use case where you have multiple values and you want to skip specific rows, let's say you're looking for (ID1, ID2) with values of (3108, 4813). You want to skip over every row that starts with anything other than 3, then anything other than 31, and so on, and then skip any row starting with anything other than 3108,4 (assuming your csv delimiter is a comma), and so on until you get exactly the ID1 and ID2 you're looking for, this is reading the data at a character level.

Pandas does not allow you to do this (as far as I know, someone can correct this response if it does). The other example uses Dask, which is a library designed by default to handle data much larger than the RAM at scale, but is not optimized for index management as DBMS's are. Don't get me wrong, Dask is good, but not for your use case.

Another very basic alternative would be to index your data based on ID1 and ID2, store them indexed, and only look up your data through actual file reading by skipping lines that do not start with your designated ID1, and then skipping lines that do not start with your ID2, and so on, however, the best practice would be to use a DBMS, as caching, read optimization, among many other serious pros would be available; reducing the I/O read time from your disk.

You can get started with MySQL here: https://dev.mysql.com/doc/mysql-getting-started/en/

You can get started with PostgreSQL here: https://www.postgresqltutorial.com/postgresql-getting-started/

I created a multi index for the dataframe which combines ID1 and ID2 as Indexes is "only look up your data through actual file reading by skipping lines that do not start with your designated ID1, and then skipping lines that do not start with your ID2" Faster if yes then please show me how to do it

Christian Eslabon · Accepted Answer · 2022-05-29 14:21:23Z

0

 import os
 os.system('pip install dask')
 import dask.dataframe as dd
 dd_data = dd.read_csv('sample.csv')
 bool_filter_conditions = (dd_data['ID1'] == 'a') & (dd_data['ID2'] == 'b')
 dd_result = dd_data[bool_filter_conditions][['val']]
 dd_output = dd_result.compute()
 dd_output

answered May 29, 2022 at 14:21

Christian Eslabon

1854 silver badges10 bronze badges

Collectives™ on Stack Overflow

Optimize processing of large CSV file Python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related