3

Trying to see how hard or easy this is to do with Pandas.

Let's say one has a two columns with data such as:

Cat1  Cat2
A        1
A        2
A        3
B        1
B        2
C        1
C        2
C        3
D        4

As you see A and C have three common elements 1, 2, 3. B however has only two elements 1 and 2. D has only one element: 4.

How would one programmatically get to this same result. The idea will be to have each group returned somehow. So one will be [A, C] and [1, 2, 3], then [B] and [1, 2] and [D] with [4].

I know a program can be written to do this so I am trying to figure out if there is something on Pandas to do it without having to build stuff from scratch.

Thanks!

2 Answers 2

3

You can use groupby twice to achieve this.

df = df.groupby('Cat1')['Cat2'].apply(lambda x: tuple(set(x))).reset_index()
df = df.groupby('Cat2')['Cat1'].apply(lambda x: tuple(set(x))).reset_index()

I'm using tuple because pandas needs elements to be hashable in order to do a groupby. The code above doesn't distinguish between (1, 2, 3) and (1, 1, 2, 3). If you want to make this distinction, replace set with sorted.

The resulting output:

        Cat2    Cat1
0     (1, 2)    (B,)
1  (1, 2, 3)  (A, C)
2       (4,)    (D,)
Sign up to request clarification or add additional context in comments.

Comments

0

You could also:

df = df.set_index('Cat1', append=True).unstack().loc[:, 'Cat2']
df = pd.Series({col: tuple(values.dropna()) for col, values in df.items()})
df = df.groupby(df.values).apply(lambda x: list(x.index))

to get

                   Cat1
(1.0, 2.0)          [B]
(1.0, 2.0, 3.0)  [A, C]
(4.0,)              [D]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.