Modifying Python DataFrame rows with duplicates

Question

I need to modify a python pandas dataframe. Consider

is my dataframe. Ids 3 and 4 appear twice. For rows having Id 3, Col has values p and sp. Similarly for Id 4 we see values n and sn in Col. I want to remove row having Col as p for Id 3 and row having Col as n for Id 4. So i wnat my dataframe to look like

Id    Col
1     a
2     a
3     sp
4     sn
5     b
6     c

so basically, here is what i need to do

Check if ther are any duplicates. Lets assume that the duplicates only occur in pairs and not in triples or more.
Then if the value of the Col is same, then we keep only one such row.
If the values in the Col are p and sp, i want to keep the row that has sp.
If the values in the Col are n and sn, i want to keep the row that has sn.

how can i achieve this?

EDIT

actually, ideally i would need to check before deciding which row to drop. Lets say i know that there are multiple rows with Id 3 and the corresponding values of Col are

p
sp

now i want to collect these values in a list as

['p','sp']

and send it to a function like

def giveMeBest(paramList):

   bestVal = ""

   for param in paramList:
    '''
    some logic goes here
   '''
   return bestVal

then i only keep the row which has value bestVal in Col. Note that this will also allow me to handle any number of duplicates.

EDIT2

Thanks rurp for the answer. I just one last request. I am trying to clean up my data frame by doing the following

for x in result:

        resVal = getVal(x[1])

        '''
        getVal returns the appropriate  value that i want to be set in 
        my dataframe. Note that x[1] will denote the array of duplicate values in Col

        '''

        resData = resData[(resData.Id == x[0]) & (resData.Col!=resVal)]

but this still does not delete the rows

print(resData[resData.Id==3])

Id Col
3  p
3  sp

i even tried

resData = resData.drop(resData[(resData.Id == int(x[0])) & (resData.Col!=resSent)].index)

but it still shows the duplicate row.

how can i drop multiple rows from my data frame ?

Solved dropping rows

here is how i did it

idx = []
for x in result:

    resVal = getVal(x[1])

    idx.append(resData[(resData.Id == x[0]) & (resData.Col!= resVal)].index.tolist())

and then, just

for j in idx:
    resData = resData.drop(j)

Andy Hayden · Accepted Answer · 2015-10-02 16:33:11Z

2

Assuming that the ss were always last you could use drop_duplicates:

In [11]: df.drop_duplicates(take_last=True, subset=["Col"])
Out[11]:
   Id Col
1   2   a
2   3   p
3   3  sp
4   4   n
5   4  sn
6   5   b
7   6   c

If not sort them so they are. The easiest way is to pull out a column of is_s (e.g. .str.startswith("s")) and sorting by that before dropping duplicates.

answered Oct 2, 2015 at 16:33

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AbtPst Over a year ago

thanks ! really nice suggestion. please see the edit, as i think i need something more specific

rurp · Accepted Answer · 2015-10-02 19:37:36Z

1

You could create a list of tuples containing each 'Id' value that occurs more than once and a list of the corresponding values in 'Col'. Those values could then be passed in to your function to determine which to remove.

import pandas as pd

ids = [1,2,3,3,4,4,5,6]
cols = ['a', 'a', 'p', 'sp', 'n', 'sn', 'b', 'c']

df = pd.DataFrame({'Id':ids, 'Col':cols})

counts = df['Id'].value_counts()
values = [x for x in counts.index if counts[x]>1]
result = []
for e in values:
    vals = df[df['Id'] == e].Col.value_counts().index.values
    result.append((e, vals))

This give you

for n in result:
    print n

(4, array(['n', 'sn'], dtype=object))
(3, array(['sp', 'p'], dtype=object))

Hopefully that helps.

answered Oct 2, 2015 at 19:37

rurp

1,4563 gold badges17 silver badges24 bronze badges

2 Comments

AbtPst Over a year ago

perfect ! exactly what i needed :) now i can run my logic and cleanup the data. thanks a lot :)

AbtPst Over a year ago

lastly, what would be the efficient way to delete the rows now? i guess a double loop is the most straightforward

Collectives™ on Stack Overflow

Modifying Python DataFrame rows with duplicates

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related