0

I have a pd.Series of list items. I define two locations to be duplicates if they have one or more list items in common. This definition should be transitive, meaning that if locations A and B are duplicates, and locations B and C are duplicates, then locations A and C are duplicates.

Examples:


In [117]: df
Out[117]: 
        A  dupe_group_ix
0  [A, B]              0
1  [D, X]              0
2     [B]              0
3  [D, A]              0
4     [A]              0

All rows are duplicates. Note that row 0 and 1 are duplicates because row 0 and 3 are duplicates, as are row 1 and 3.


In [125]: df
Out[125]: 
        A  dupe_group_ix
0  [A, B]              0
1  [D, X]              1
2     [B]              0
3  [K, D]              1
4     [A]              0

In this examples, there are two separate groups of duplicates.

1
  • So it means all rows are duplicated? Because A is in 0,4 index, B in 0,2, D in 1, 3 ? Commented Apr 9, 2020 at 6:46

2 Answers 2

1

You can use a helper function to map the group id:

grp = {'_':-1}
def map_grp_id(x):
    grp_id = np.max([grp.get(e, -1) for e in x])
    if grp_id < 0:
        grp_id = max(grp.values())+1
        grp.update({e:grp_id for e in x})
    return grp_id

df['dupe_group_ix'] = df.A.apply(map_grp_id)

    A       dupe_group_ix
0   [A, B]              0
1   [D, X]              1
2   [B]                 0
3   [D, K]              1
4   [A]                 0
Sign up to request clarification or add additional context in comments.

1 Comment

Brilliant solution! Unless there's a speed improvement to using apply vs map, then map(map_grp_id, na_action='ignore') allows for nans. Also, I would suggest starting out with df.loc[df['A'].map(len, na_action='ignore').eq(0), 'A'] = np.nan; df['dupe_group_ix'] = np.nan; df['dupe_group_ix'] = df['dupe_group_ix'].astype(pd.Int64Dtype())
0

An improved version of @Allen's answer, much much faster (150ms vs 1min46sec) and allowing nans and empty columns.

        grp = {}

        def map_grp_id(x):
            for e in x:
                grp_id = grp.get(e, None)
                if grp_id is not None:
                    break
            else:
                grp_id = len(grp)
            grp.update({e: grp_id for e in x})
            return grp_id

        df.loc[df['A'].map(len, na_action='ignore').eq(0), 'A'] = pd.NA
        df['dupe_group_ix'] = pd.NA
        df['dupe_group_ix'] = df['dupe_group_ix'].astype(pd.Int64Dtype())
        df['dupe_group_ix'] = df[A].map(map_grp_id, na_action='ignore')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.