Is there a faster way to separate duplicate and different data from CSV using python?

Question

I have a dataframe contains millions data. Suppose this is the dataframe named mydataframe:

filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
A        |         4 |         4 |         3 |         3
B        |         3 |         5 |         2 |         2
C        |         5 |         5 |         6 |         7
D        |         2 |         2 |         3 |         3
E        |         4 |         5 |         5 |         3
---------------------------------------------------------

I need to separate the file based on the different number of insert or delete, then save them into new CSV file, named different.csv. And also save the rest of the data having the same number of insert and delete in the separate CSV file called same.csv. In the other words, if the file has a different number between #insert-1 and #insert-2, or #delete-1 and #delete-2 then save it in different.csv, otherwise, save it in same.csv.

The expected result: different.csv:

filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
B        |         3 |         5 |         2 |         2
C        |         5 |         5 |         6 |         7
E        |         4 |         5 |         5 |         3
---------------------------------------------------------

same.csv

filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
A        |         4 |         4 |         3 |         3
D        |         2 |         2 |         3 |         3
---------------------------------------------------------

This is my code so far:

df_different = []
df_same = []
for row in range(0, len(mydataframe)):
    ins_1 = mydataframe.iloc[row][1]
    ins_2 = mydataframe.iloc[row][2]
    del_1 = mydataframe.iloc[row][3]
    del_2 = mydataframe.iloc[row][4]
    if (ins_1 != ins_2) or (del_1 != del_2):
        df_different.append(mydataframe.iloc[row])
    else:
        df_same.append(mydataframe.iloc[row])

with open('different.csv','w') as diffcsv:
    writers = csv.writer(diffcsv, delimiter=',')
    writers.writerow(fields)
    for item in df_different:
        writers.writerow(item)

with open('same.csv','w') as diffcsv:
    writers = csv.writer(diffcsv, delimiter=',')
    writers.writerow(fields)
    for item in df_same:
        writers.writerow(item)

Actually, the code works well but when the dataset is very large (I have millions of data), it takes very long time (more than 3 hours) to perform. My question is whether there is a method to make it faster. Thank you.

for me this sounds like it could be parallised easily. just split your dataframe. more for parallisation: stackoverflow.com/questions/20548628/…, but yeah the answer of DSM is what your searching for i guess — Rend
– Rend, Commented Jul 4, 2018 at 13:20
Would be good to see timings on the actual data. @Yusuf - would you be so kind as to time the solutions provided below to demonstrate the gains? — sophros
– sophros, Commented Jul 4, 2018 at 14:21

DSM · Accepted Answer · 2018-07-04 13:20:31Z

6

Avoid iterating over rows; that's pretty slow. Instead, vectorize the comparison operation:

same_mask = (df["#insert-1"] == df["#insert-2"]) & (df["#delete-1"] == df["#delete-2"])
df.loc[same_mask].to_csv("same.csv", index=False)
df.loc[~same_mask].to_csv("different.csv", index=False)

For a dataframe of 1M rows, this takes me only a few seconds.

answered Jul 4, 2018 at 13:20

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

YusufUMS Over a year ago

It saved my time significantly. For more than 3 hours using my own code, it reduced to only no more than 10 seconds.

sophros · Accepted Answer · 2018-07-04 13:15:22Z

3

One of the easy things you can do is to provide sufficiently large buffers to open function (buffering=64*1024*1024) could help (64MB buffer).

Another thing is iteration over the dataframe - instead of iterating over row numbers you could iterate directly over rows, like:

for index, row in mydataframe.iterrows():
    ins_1 = row[1]
    ins_2 = row[2]
    del_1 = row[3]
    del_2 = row[4]

I would expect it to be much faster.

answered Jul 4, 2018 at 13:15

sophros

17.3k12 gold badges52 silver badges84 bronze badges

2 Comments

Jean-François Fabre Over a year ago

if you have only 4 elements you can do ins_1,ins_2,del_1,del_2 = row

sophros Over a year ago

@Jean-FrançoisFabre: You are right. This should be even faster!

vermanil · Accepted Answer · 2018-07-04 13:23:01Z

2

Use directly Data Frame query:

Same_data frame:

same_dataframe = mydataframe[(mydataframe["insert1"] == mydataframe["insert2"]) & (mydataframe["delete1"] == mydataframe["delete2"])]

Different Dataframe:

different_data = mydataframe[(mydataframe["insert1"] != mydataframe["insert2"]) | (mydataframe["delete1"] != mydataframe["delete2"])]

I think, it is faster than iteration.

Hope, It will help.

answered Jul 4, 2018 at 13:23

vermanil

2021 silver badge8 bronze badges

Collectives™ on Stack Overflow

Is there a faster way to separate duplicate and different data from CSV using python?

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related