How to drop rows of Pandas dataframe with same value based on condition in different column

Question

I am new to Python and Pandas, so please bear with me. I have a rather simple problem to solve, I suppose, but cannot seem to get it right. I have a csv-file, that I would like to edit with a pandas dataframe. The data presents flows from home to work locations, and the locations' respective ids as well as coordinates in lat/lon and a value for each flow.

id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,10,"Schleswig-Holstein",54.212,9.959,7618
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2000,"Hamburg, Freie und Hansestadt",53.57071859,9.943770215,567
1001,"Flensburg",54.78879007,9.4459971,20,"Hamburg",53.575,9.941,567
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,100,"Saarland",49.379,6.979,25
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11000,"Berlin, Stadt",52.50395948,13.39337765,274
1003,"Lübeck",53.88132436,10.72749774,110,"Berlin",52.507,13.405,274
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274

I would like to delete all adjacent duplicate rows with the same value and only keep the last row, where id_work is either one-digit or two-digits. All other rows should be deleted. How can I achieve this? What I essentially need is the following output:

   id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274

Super thankful for any help!

Will the duplicates always be adjacent? Is the one you want to keep always the last one? — John Zwinck
– John Zwinck, Commented May 5, 2016 at 9:51
Yes, they are always adjacent. In some cases there are up to four duplicates with 4,3,2 and 1-digit ids. I would like to keep only the last (i.e. 1-digit) row. — Yoya01
– Yoya01, Commented May 5, 2016 at 10:39
based on your revised sample data df.drop_duplicates('value', keep='last') should work — EdChum
– EdChum, Commented May 5, 2016 at 11:13
Unfortunately not, since it removes all the duplicate rows (except the last one) in the whole data set with the same entry in the value column. However, I do not need to delete all duplicate rows in the data set, but only those rows where duplicate values are adjacent (at least two) and then keep the last row. — Yoya01
– Yoya01, Commented May 5, 2016 at 11:41

EdChum · Accepted Answer · 2016-05-05 09:58:37Z

1

drop_duplicates has a keep param, set this to last:

In [188]:
df.drop_duplicates(subset=['value'], keep='last')

Out[188]:
    id   name  value
0  345  name1    456
1   12  name2    220
5    2  name6    567

Actually I think the following is what you want:

In [197]:
df.drop(df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)])

Out[197]:
    id   name  value
0  345  name1    456
1   12  name2    220
5    2  name6    567

Here we drop the row labels that have duplicate values and where the 'id' length is not 1, a breakdown:

In [198]:
df['value'].duplicated()

Out[198]:
0    False
1    False
2    False
3     True
4     True
5     True
Name: value, dtype: bool

In [199]:
df.loc[df['value'].duplicated(), 'value']

Out[199]:
3    567
4    567
5    567
Name: value, dtype: int64

In [200]:
df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())

Out[200]:
0    False
1    False
2     True
3     True
4     True
5     True
Name: value, dtype: bool

In [201]:

(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)

Out[201]:
0    False
1    False
2     True
3     True
4     True
5    False
dtype: bool

In [202]:
df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)]

Out[202]:
Int64Index([2, 3, 4], dtype='int64')

So the above uses duplicated to return the duplicated values, unique to return just the unique duplicated values, isin to test for membership we cast the 'id' column to str so we can test the length using str.len and use the boolean mask to mask the index labels.

edited May 5, 2016 at 9:58

answered May 5, 2016 at 9:51

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Yoya01 Over a year ago

Thanks! However, this seems to delete all duplicates in the whole data set. I would like to only delete the duplicates that are adjacent to each other, that is if there are at least two duplicates adjacent to each other, delete everything except the ones where id.len 1.

EdChum Over a year ago

You should specify this in your question as it matters

Yoya01 Over a year ago

Yes, sorry for that!

EdChum Over a year ago

Additionally you should include data that is representative of your real data

John Zwinck · Accepted Answer · 2016-05-05 12:01:18Z

0

Let's simplify this to the case where you have a single array:

arr = np.array([1, 1, 1, 2, 0, 0, 1, 1, 2, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1])

Now let's generate an array of bools which shows us the places where the values change:

arr[1:] != arr[:-1]

That tells us which values we want to keep--the values which are different from the next ones. But it leaves out the last value, which should always be included, so:

mask = np.hstack((arr[1:] != arr[:-1], True))

Now, arr[mask] gives us:

array([1, 2, 0, 1, 2, 0, 2, 1, 0, 1])

And in case you don't believe the last occurrence of each element was selected, you can check mask.nonzero() to get the indexes numerically:

array([ 2,  3,  5,  7,  8, 12, 13, 14, 16, 19])

Now that you know how to generate the mask for a single column, you can simply apply it to your entire dataframe as df[mask].

answered May 5, 2016 at 12:01

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

2 Comments

Yoya01 Over a year ago

This is really what I am looking for! Still cannot figure out how to to apply it to my whole dataframe though? Could you maybe give me another hint on the next step?

John Zwinck Over a year ago

df[mask] applies it to the whole dataframe.

Collectives™ on Stack Overflow

How to drop rows of Pandas dataframe with same value based on condition in different column

2 Answers 2

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related