4

I have a pandas data frame:

>>df_freq = pd.DataFrame([["Z11", "Z11", "X11"], ["Y11","",""], ["Z11","Z11",""]], columns=list('ABC'))

>>df_freq
    A   B   C
0   Z11 Z11 X11
1   Y11     
2   Z11 Z11 

I want to make sure each row has unique values only. Therefore it should become like this: Removed values can be replaced with zero or empty

    A   B   C
0   Z11 0   X11
1   Y11     
2   Z11 0   

My data frame is big with hundreds of columns and thousands of rows. The goal is to count the unique values in that data frame. I do that by using converting data frame to matrix and applying

>>np.unique(mat.astype(str), return_counts=True)

But in certain row(s) the same value occurs and I want to remove that before applying np.unique() method. I want to keep unique values in each row.

3 Answers 3

6

use a combination of astype(bool) and duplicated

mask = df_freq.apply(pd.Series.duplicated, 1) & df_freq.astype(bool)

df_freq.mask(mask, 0)

     A  B    C
0  Z11  0  X11
1  Y11        
2  Z11  0     
Sign up to request clarification or add additional context in comments.

4 Comments

Could you convert it to a in-situ change based one? Would like to time it.
Hey piR, what's the reason for df_freq.astype(bool)? The code seems to work fine without it.
@moondra OP had blanks '' in the dataframe, and multiple ones. I didn't want to put zeros when the second, third, etc blanks showed up. So, I know that '' evaluates as False therefore I take df_freq.astype(bool) to create a mask where things are not blank. That in conjunction with when there are duplicates allow me to accurately place the zeros. Otherwise, I'd end up with an extra zero in df_freq.loc[1, 'C']
Ah! You're right. I didn't even catch that extra 0 at df_freq.loc[1, 'C']
2

Here's a vectorized NumPy approach -

def reset_rowwise_dups(df):
    n = df.shape[0]
    row_idx = np.arange(n)[:,None]

    a = df_freq.values
    idx = np.argsort(a,1)
    sorted_a = a[row_idx, idx]
    idx_reversed = idx.argsort(1)
    sorted_a_dupmask = sorted_a[:,1:] == sorted_a[:,:-1]
    dup_mask = np.column_stack((np.zeros(n,dtype=bool), sorted_a_dupmask))
    final_mask = dup_mask[row_idx, idx_reversed] & (a != '' )
    a[final_mask] = 0

Sample run -

In [80]: df_freq
Out[80]: 
     A    B    C    D
0  Z11  Z11  X11  Z11
1  Y11            Y11
2  Z11  Z11       X11

In [81]: reset_rowwise_dups(df_freq)

In [82]: df_freq
Out[82]: 
     A  B    C    D
0  Z11  0  X11    0
1  Y11            0
2  Z11  0       X11

Runtime test

# Proposed earlier in this post
def reset_rowwise_dups(df):
    n = df.shape[0]
    row_idx = np.arange(n)[:,None]

    a = df.values
    idx = np.argsort(a,1)
    sorted_a = a[row_idx, idx]
    idx_reversed = idx.argsort(1)
    sorted_a_dupmask = sorted_a[:,1:] == sorted_a[:,:-1]
    dup_mask = np.column_stack((np.zeros(n,dtype=bool), sorted_a_dupmask))
    final_mask = dup_mask[row_idx, idx_reversed] & (a != '' )
    a[final_mask] = 0

# @piRSquared's soln using pandas apply
def apply_based(df):
    mask = df.apply(pd.Series.duplicated, 1) & df.astype(bool)
    return df.mask(mask, 0)

Timings -

In [151]: df_freq = pd.DataFrame([["Z11", "Z11", "X11", "Z11"], \
     ...:  ["Y11","","", "Y11"],["Z11","Z11","","X11"]], columns=list('ABCD'))

In [152]: df_freq
Out[152]: 
     A    B    C    D
0  Z11  Z11  X11  Z11
1  Y11            Y11
2  Z11  Z11       X11

In [153]: df = pd.concat([df_freq]*10000,axis=0)

In [154]: df.index = range(df.shape[0])

In [155]: %timeit apply_based(df)
1 loops, best of 3: 3.35 s per loop

In [156]: %timeit reset_rowwise_dups(df)
100 loops, best of 3: 12.7 ms per loop

1 Comment

Sorry was busy :-)
0
def replaceDuplicateData(nestedList):
    for row in range(len(nestedList)):
        uniqueDataRow = []
        for col in range(len(nestedList[row])):
            if nestedList[row][col] not in uniqueDataRow:
                uniqueDataRow.append(nestedList[row][col])
            else:
                nestedList[row][col] = 0
    return nestedList

nestedList = [["Z11", "Z11", "X11"], ["Y11","",""], ["Z11","Z11",""]]
print (replaceDuplicateData(nestedList))

Basically you can use that function above to remove duplication in your matrix.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.