delete duplicate values in a row of pandas dataframe

Question

I have a pandas data frame:

>>df_freq = pd.DataFrame([["Z11", "Z11", "X11"], ["Y11","",""], ["Z11","Z11",""]], columns=list('ABC'))

>>df_freq
    A   B   C
0   Z11 Z11 X11
1   Y11     
2   Z11 Z11

I want to make sure each row has unique values only. Therefore it should become like this: Removed values can be replaced with zero or empty

    A   B   C
0   Z11 0   X11
1   Y11     
2   Z11 0

My data frame is big with hundreds of columns and thousands of rows. The goal is to count the unique values in that data frame. I do that by using converting data frame to matrix and applying

>>np.unique(mat.astype(str), return_counts=True)

But in certain row(s) the same value occurs and I want to remove that before applying np.unique() method. I want to keep unique values in each row.

piRSquared · Accepted Answer · 2017-05-10 17:24:02Z

6

use a combination of astype(bool) and duplicated

mask = df_freq.apply(pd.Series.duplicated, 1) & df_freq.astype(bool)

df_freq.mask(mask, 0)

     A  B    C
0  Z11  0  X11
1  Y11        
2  Z11  0

answered May 10, 2017 at 17:24

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Divakar Over a year ago

Could you convert it to a in-situ change based one? Would like to time it.

Moondra Over a year ago

Hey piR, what's the reason for df_freq.astype(bool)? The code seems to work fine without it.

piRSquared Over a year ago

@moondra OP had blanks '' in the dataframe, and multiple ones. I didn't want to put zeros when the second, third, etc blanks showed up. So, I know that '' evaluates as False therefore I take df_freq.astype(bool) to create a mask where things are not blank. That in conjunction with when there are duplicates allow me to accurately place the zeros. Otherwise, I'd end up with an extra zero in df_freq.loc[1, 'C']

Moondra Over a year ago

Ah! You're right. I didn't even catch that extra 0 at df_freq.loc[1, 'C']

Divakar · Accepted Answer · 2017-05-10 17:55:42Z

Here's a vectorized NumPy approach -

def reset_rowwise_dups(df):
    n = df.shape[0]
    row_idx = np.arange(n)[:,None]

    a = df_freq.values
    idx = np.argsort(a,1)
    sorted_a = a[row_idx, idx]
    idx_reversed = idx.argsort(1)
    sorted_a_dupmask = sorted_a[:,1:] == sorted_a[:,:-1]
    dup_mask = np.column_stack((np.zeros(n,dtype=bool), sorted_a_dupmask))
    final_mask = dup_mask[row_idx, idx_reversed] & (a != '' )
    a[final_mask] = 0

Sample run -

In [80]: df_freq
Out[80]: 
     A    B    C    D
0  Z11  Z11  X11  Z11
1  Y11            Y11
2  Z11  Z11       X11

In [81]: reset_rowwise_dups(df_freq)

In [82]: df_freq
Out[82]: 
     A  B    C    D
0  Z11  0  X11    0
1  Y11            0
2  Z11  0       X11

Runtime test

# Proposed earlier in this post
def reset_rowwise_dups(df):
    n = df.shape[0]
    row_idx = np.arange(n)[:,None]

    a = df.values
    idx = np.argsort(a,1)
    sorted_a = a[row_idx, idx]
    idx_reversed = idx.argsort(1)
    sorted_a_dupmask = sorted_a[:,1:] == sorted_a[:,:-1]
    dup_mask = np.column_stack((np.zeros(n,dtype=bool), sorted_a_dupmask))
    final_mask = dup_mask[row_idx, idx_reversed] & (a != '' )
    a[final_mask] = 0

# @piRSquared's soln using pandas apply
def apply_based(df):
    mask = df.apply(pd.Series.duplicated, 1) & df.astype(bool)
    return df.mask(mask, 0)

Timings -

In [151]: df_freq = pd.DataFrame([["Z11", "Z11", "X11", "Z11"], \
     ...:  ["Y11","","", "Y11"],["Z11","Z11","","X11"]], columns=list('ABCD'))

In [152]: df_freq
Out[152]: 
     A    B    C    D
0  Z11  Z11  X11  Z11
1  Y11            Y11
2  Z11  Z11       X11

In [153]: df = pd.concat([df_freq]*10000,axis=0)

In [154]: df.index = range(df.shape[0])

In [155]: %timeit apply_based(df)
1 loops, best of 3: 3.35 s per loop

In [156]: %timeit reset_rowwise_dups(df)
100 loops, best of 3: 12.7 ms per loop

Oscar Zhang · Accepted Answer · 2017-05-10 17:26:54Z

0

def replaceDuplicateData(nestedList):
    for row in range(len(nestedList)):
        uniqueDataRow = []
        for col in range(len(nestedList[row])):
            if nestedList[row][col] not in uniqueDataRow:
                uniqueDataRow.append(nestedList[row][col])
            else:
                nestedList[row][col] = 0
    return nestedList

nestedList = [["Z11", "Z11", "X11"], ["Y11","",""], ["Z11","Z11",""]]
print (replaceDuplicateData(nestedList))

Basically you can use that function above to remove duplication in your matrix.

answered May 10, 2017 at 17:26

Oscar Zhang

3803 silver badges9 bronze badges

Collectives™ on Stack Overflow

delete duplicate values in a row of pandas dataframe

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related