Pandas duplicate rows based on list items

Question

I have a pd.Series of list items. I define two locations to be duplicates if they have one or more list items in common. This definition should be transitive, meaning that if locations A and B are duplicates, and locations B and C are duplicates, then locations A and C are duplicates.

Examples:


In [117]: df
Out[117]: 
        A  dupe_group_ix
0  [A, B]              0
1  [D, X]              0
2     [B]              0
3  [D, A]              0
4     [A]              0

All rows are duplicates. Note that row 0 and 1 are duplicates because row 0 and 3 are duplicates, as are row 1 and 3.


In [125]: df
Out[125]: 
        A  dupe_group_ix
0  [A, B]              0
1  [D, X]              1
2     [B]              0
3  [K, D]              1
4     [A]              0

In this examples, there are two separate groups of duplicates.

So it means all rows are duplicated? Because A is in 0,4 index, B in 0,2, D in 1, 3 ? — jezrael
– jezrael, Commented Apr 9, 2020 at 6:46

Allen Qin · Accepted Answer · 2020-04-09 08:16:59Z

1

You can use a helper function to map the group id:

grp = {'_':-1}
def map_grp_id(x):
    grp_id = np.max([grp.get(e, -1) for e in x])
    if grp_id < 0:
        grp_id = max(grp.values())+1
        grp.update({e:grp_id for e in x})
    return grp_id

df['dupe_group_ix'] = df.A.apply(map_grp_id)

    A       dupe_group_ix
0   [A, B]              0
1   [D, X]              1
2   [B]                 0
3   [D, K]              1
4   [A]                 0

answered Apr 9, 2020 at 8:16

Allen Qin

20k9 gold badges55 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tsorn Over a year ago

Brilliant solution! Unless there's a speed improvement to using apply vs map, then map(map_grp_id, na_action='ignore') allows for nans. Also, I would suggest starting out with

df.loc[df['A'].map(len, na_action='ignore').eq(0), 'A'] = np.nan; df['dupe_group_ix'] = np.nan; df['dupe_group_ix'] = df['dupe_group_ix'].astype(pd.Int64Dtype())

tsorn · Accepted Answer · 2020-05-07 12:17:52Z

0

An improved version of @Allen's answer, much much faster (150ms vs 1min46sec) and allowing nans and empty columns.

        grp = {}

        def map_grp_id(x):
            for e in x:
                grp_id = grp.get(e, None)
                if grp_id is not None:
                    break
            else:
                grp_id = len(grp)
            grp.update({e: grp_id for e in x})
            return grp_id

        df.loc[df['A'].map(len, na_action='ignore').eq(0), 'A'] = pd.NA
        df['dupe_group_ix'] = pd.NA
        df['dupe_group_ix'] = df['dupe_group_ix'].astype(pd.Int64Dtype())
        df['dupe_group_ix'] = df[A].map(map_grp_id, na_action='ignore')

edited May 7, 2020 at 12:17

answered Apr 9, 2020 at 10:55

tsorn

3,6651 gold badge33 silver badges53 bronze badges

Collectives™ on Stack Overflow

Pandas duplicate rows based on list items

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related