0

I am new to Python and Pandas, so please bear with me. I have a rather simple problem to solve, I suppose, but cannot seem to get it right. I have a csv-file, that I would like to edit with a pandas dataframe. The data presents flows from home to work locations, and the locations' respective ids as well as coordinates in lat/lon and a value for each flow.

id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,10,"Schleswig-Holstein",54.212,9.959,7618
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2000,"Hamburg, Freie und Hansestadt",53.57071859,9.943770215,567
1001,"Flensburg",54.78879007,9.4459971,20,"Hamburg",53.575,9.941,567
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,100,"Saarland",49.379,6.979,25
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11000,"Berlin, Stadt",52.50395948,13.39337765,274
1003,"Lübeck",53.88132436,10.72749774,110,"Berlin",52.507,13.405,274
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274

I would like to delete all adjacent duplicate rows with the same value and only keep the last row, where id_work is either one-digit or two-digits. All other rows should be deleted. How can I achieve this? What I essentially need is the following output:

   id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274

Super thankful for any help!

4
  • Will the duplicates always be adjacent? Is the one you want to keep always the last one? Commented May 5, 2016 at 9:51
  • Yes, they are always adjacent. In some cases there are up to four duplicates with 4,3,2 and 1-digit ids. I would like to keep only the last (i.e. 1-digit) row. Commented May 5, 2016 at 10:39
  • based on your revised sample data df.drop_duplicates('value', keep='last') should work Commented May 5, 2016 at 11:13
  • 1
    Unfortunately not, since it removes all the duplicate rows (except the last one) in the whole data set with the same entry in the value column. However, I do not need to delete all duplicate rows in the data set, but only those rows where duplicate values are adjacent (at least two) and then keep the last row. Commented May 5, 2016 at 11:41

2 Answers 2

1

drop_duplicates has a keep param, set this to last:

In [188]:
df.drop_duplicates(subset=['value'], keep='last')

Out[188]:
    id   name  value
0  345  name1    456
1   12  name2    220
5    2  name6    567

Actually I think the following is what you want:

In [197]:
df.drop(df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)])

Out[197]:
    id   name  value
0  345  name1    456
1   12  name2    220
5    2  name6    567

Here we drop the row labels that have duplicate values and where the 'id' length is not 1, a breakdown:

In [198]:
df['value'].duplicated()

Out[198]:
0    False
1    False
2    False
3     True
4     True
5     True
Name: value, dtype: bool

In [199]:
df.loc[df['value'].duplicated(), 'value']

Out[199]:
3    567
4    567
5    567
Name: value, dtype: int64

In [200]:
df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())

Out[200]:
0    False
1    False
2     True
3     True
4     True
5     True
Name: value, dtype: bool

In [201]:

(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)

Out[201]:
0    False
1    False
2     True
3     True
4     True
5    False
dtype: bool

In [202]:
df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)]

Out[202]:
Int64Index([2, 3, 4], dtype='int64')

So the above uses duplicated to return the duplicated values, unique to return just the unique duplicated values, isin to test for membership we cast the 'id' column to str so we can test the length using str.len and use the boolean mask to mask the index labels.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! However, this seems to delete all duplicates in the whole data set. I would like to only delete the duplicates that are adjacent to each other, that is if there are at least two duplicates adjacent to each other, delete everything except the ones where id.len 1.
You should specify this in your question as it matters
Yes, sorry for that!
Additionally you should include data that is representative of your real data
0

Let's simplify this to the case where you have a single array:

arr = np.array([1, 1, 1, 2, 0, 0, 1, 1, 2, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1])

Now let's generate an array of bools which shows us the places where the values change:

arr[1:] != arr[:-1]

That tells us which values we want to keep--the values which are different from the next ones. But it leaves out the last value, which should always be included, so:

mask = np.hstack((arr[1:] != arr[:-1], True))

Now, arr[mask] gives us:

array([1, 2, 0, 1, 2, 0, 2, 1, 0, 1])

And in case you don't believe the last occurrence of each element was selected, you can check mask.nonzero() to get the indexes numerically:

array([ 2,  3,  5,  7,  8, 12, 13, 14, 16, 19])

Now that you know how to generate the mask for a single column, you can simply apply it to your entire dataframe as df[mask].

2 Comments

This is really what I am looking for! Still cannot figure out how to to apply it to my whole dataframe though? Could you maybe give me another hint on the next step?
df[mask] applies it to the whole dataframe.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.