0

I have a CSV file of about 175 millions lines (2.86 GB), composed of three columns as shown below :

I need to get the value in column "val" given "ID1" and "ID2". I query this dataframe constantly with varying combination of ID1 and ID2, which are unique in the whole file.

I have tried to use pandas as shown below, but results are taking a lot of time.

def is_av(Qterm, Cterm, df):
    try:
        return df.loc[(Qterm, Cterm),'val']
    except KeyError:
        return 0

Is there a faster way to access CSV values, knowing that this value is located in one single row of the whole file. If not could you check this function and tell me what might be the issue of slow processing

for nc in L:#ID1
    score = 0.0
    for ni in id_list:#ID2
        e = is_av(ni,nc,df_g)
        InDegree = df1.loc[ni].values[0]
        SumInMap = df2.loc[nc].values[0]
        score = score + term_score(InDegree, SumInMap, e) #compute a score 
    key  = pd_df3.loc[nc].values[0]
    tmt[key] = score
5
  • if you need to keep looking up stuff in this data then perhaps porting it to some sort of SQL engine will make the process easier, or try using df.query Commented May 29, 2022 at 13:04
  • could the nested loops also be a source of slow processing Commented May 29, 2022 at 13:09
  • unless you have repeating ni,nc values, then no. Commented May 29, 2022 at 13:11
  • how about summing the scores should I save all the corresponding scores for id_list and then do the sum once for each loop of id_list Commented May 29, 2022 at 13:14
  • could I use multiprocessing since I think I can convert the lines under the nested loop into a function that takes the two lists and returns a dict containing the scores Commented May 29, 2022 at 14:11

2 Answers 2

1

TL;DR: Use a DBMS (I suggest MySQL or PostgreSQL). Pandas is definitely not suited for this sort of work. Dask is better, but not as good as a traditional DBMS.


The absolute best way of doing this would be to use SQL, consider MySQL or PostgreSQL for starters (both free and very efficient alternatives for your current use case). While Pandas is an incredibly strong library, when it comes to indexing and quick reading, this is not something it excels at, given that it needs to either load data into memory, or stream over the data with little control compared to a DBMS.

Consider your use case where you have multiple values and you want to skip specific rows, let's say you're looking for (ID1, ID2) with values of (3108, 4813). You want to skip over every row that starts with anything other than 3, then anything other than 31, and so on, and then skip any row starting with anything other than 3108,4 (assuming your csv delimiter is a comma), and so on until you get exactly the ID1 and ID2 you're looking for, this is reading the data at a character level.

Pandas does not allow you to do this (as far as I know, someone can correct this response if it does). The other example uses Dask, which is a library designed by default to handle data much larger than the RAM at scale, but is not optimized for index management as DBMS's are. Don't get me wrong, Dask is good, but not for your use case.

Another very basic alternative would be to index your data based on ID1 and ID2, store them indexed, and only look up your data through actual file reading by skipping lines that do not start with your designated ID1, and then skipping lines that do not start with your ID2, and so on, however, the best practice would be to use a DBMS, as caching, read optimization, among many other serious pros would be available; reducing the I/O read time from your disk.

You can get started with MySQL here: https://dev.mysql.com/doc/mysql-getting-started/en/

You can get started with PostgreSQL here: https://www.postgresqltutorial.com/postgresql-getting-started/

Sign up to request clarification or add additional context in comments.

1 Comment

I created a multi index for the dataframe which combines ID1 and ID2 as Indexes is "only look up your data through actual file reading by skipping lines that do not start with your designated ID1, and then skipping lines that do not start with your ID2" Faster if yes then please show me how to do it
0
 import os
 os.system('pip install dask')
 import dask.dataframe as dd
 dd_data = dd.read_csv('sample.csv')
 bool_filter_conditions = (dd_data['ID1'] == 'a') & (dd_data['ID2'] == 'b')
 dd_result = dd_data[bool_filter_conditions][['val']]
 dd_output = dd_result.compute()
 dd_output

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.