2

I need to modify a python pandas dataframe. Consider

Id    Col
1     a
2     a
3     p
3     sp
4     n
4     sn
5     b
6     c

is my dataframe. Ids 3 and 4 appear twice. For rows having Id 3, Col has values p and sp. Similarly for Id 4 we see values n and sn in Col. I want to remove row having Col as p for Id 3 and row having Col as n for Id 4. So i wnat my dataframe to look like

Id    Col
1     a
2     a
3     sp
4     sn
5     b
6     c

so basically, here is what i need to do

  1. Check if ther are any duplicates. Lets assume that the duplicates only occur in pairs and not in triples or more.

  2. Then if the value of the Col is same, then we keep only one such row.

  3. If the values in the Col are p and sp, i want to keep the row that has sp.
  4. If the values in the Col are n and sn, i want to keep the row that has sn.

how can i achieve this?

EDIT

actually, ideally i would need to check before deciding which row to drop. Lets say i know that there are multiple rows with Id 3 and the corresponding values of Col are

p
sp

now i want to collect these values in a list as

['p','sp']

and send it to a function like

def giveMeBest(paramList):

   bestVal = ""

   for param in paramList:
    '''
    some logic goes here
   '''
   return bestVal

then i only keep the row which has value bestVal in Col. Note that this will also allow me to handle any number of duplicates.

EDIT2

Thanks rurp for the answer. I just one last request. I am trying to clean up my data frame by doing the following

for x in result:

        resVal = getVal(x[1])

        '''
        getVal returns the appropriate  value that i want to be set in 
        my dataframe. Note that x[1] will denote the array of duplicate values in Col

        '''

        resData = resData[(resData.Id == x[0]) & (resData.Col!=resVal)]

but this still does not delete the rows

print(resData[resData.Id==3])

Id Col
3  p
3  sp

i even tried

resData = resData.drop(resData[(resData.Id == int(x[0])) & (resData.Col!=resSent)].index)

but it still shows the duplicate row.

how can i drop multiple rows from my data frame ?

Solved dropping rows

here is how i did it

idx = []
for x in result:

    resVal = getVal(x[1])

    idx.append(resData[(resData.Id == x[0]) & (resData.Col!= resVal)].index.tolist())

and then, just

for j in idx:
    resData = resData.drop(j)

2 Answers 2

2

Assuming that the ss were always last you could use drop_duplicates:

In [11]: df.drop_duplicates(take_last=True, subset=["Col"])
Out[11]:
   Id Col
1   2   a
2   3   p
3   3  sp
4   4   n
5   4  sn
6   5   b
7   6   c

If not sort them so they are. The easiest way is to pull out a column of is_s (e.g. .str.startswith("s")) and sorting by that before dropping duplicates.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks ! really nice suggestion. please see the edit, as i think i need something more specific
1

You could create a list of tuples containing each 'Id' value that occurs more than once and a list of the corresponding values in 'Col'. Those values could then be passed in to your function to determine which to remove.

import pandas as pd

ids = [1,2,3,3,4,4,5,6]
cols = ['a', 'a', 'p', 'sp', 'n', 'sn', 'b', 'c']

df = pd.DataFrame({'Id':ids, 'Col':cols})

counts = df['Id'].value_counts()
values = [x for x in counts.index if counts[x]>1]
result = []
for e in values:
    vals = df[df['Id'] == e].Col.value_counts().index.values
    result.append((e, vals))

This give you

for n in result:
    print n

(4, array(['n', 'sn'], dtype=object))
(3, array(['sp', 'p'], dtype=object))

Hopefully that helps.

2 Comments

perfect ! exactly what i needed :) now i can run my logic and cleanup the data. thanks a lot :)
lastly, what would be the efficient way to delete the rows now? i guess a double loop is the most straightforward

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.