1

I am working with several tables which have many-to-many relationships. What is the most efficient way to transform this data to ensure that the category column is unique and that all of the corresponding units are combined into a single row?

category    unit
A01         97337
A01         97333
A01         97334
A01         97343
A01         26223
A01         26226
A01         22722
A01         93397
A01         97332
A01         97342
A01         97369
A01         97734
A01         97332
P76         97343
P76         26223
P76         27399
P76         27277
P76         27234
P76         27297
P76         27292
P76         22723
P76         93622
P76         27343
P76         27234
P98         97337

Into this:

category    category_units
 A01        97337, 97333, 97334, 97343, 26223, 26226, 22722, 93397, 97332, 97342, 97369, 97734, 97332
 P76        97343, 26223, 93622, 99733, 27399, 27277, 27234, 27297, 27292
 P98        97337

One row per category (serves as a primary key) where each of the corresponding units are concatenated into a single column with values separated by a comma.

I would be joining this data back to another fact table and eventually the end user would filter for category_units where it 'contains' some value so it would pull up all rows which are associated with that value.

1 Answer 1

4

You can use groupby with apply join, if unit column is numeric is necessary cast to string:

df1 = df.groupby('category')['unit']
        .apply(lambda x: ', '.join(x.astype(str)))
        .reset_index()
print (df1)
  category                                               unit
0      A01  97337, 97333, 97334, 97343, 26223, 26226, 2272...
1      P76  97343, 26223, 27399, 27277, 27234, 27297, 2729...
2      P98                                              97337

Another solution with casting first:

df.unit = df.unit.astype(str)
df1 = df.groupby('category')['unit'].apply(', '.join).reset_index()
print (df1)
  category                                               unit
0      A01  97337, 97333, 97334, 97343, 26223, 26226, 2272...
1      P76  97343, 26223, 27399, 27277, 27234, 27297, 2729...
2      P98                                              97337
Sign up to request clarification or add additional context in comments.

3 Comments

Cool, works great. Due to the nature of some of my joins (a lot of intermediate tables), my end result has some duplicates within a row. for example, category = A01, units= 97337, 26223, 97337. Is there a way to cleanly remove the duplicates on a row level? I was thinking of using .str.split() but then I didn't know how to retain only the unique values per row.
You can use set or unique like df1 = df.groupby('category')['unit'] .apply(lambda x: ', '.join(x.unique().astype(str))) or df1 = df.groupby('category')['unit'] .apply(lambda x: ', '.join(set(x.astype(str))))
This all worked perfectly. I actually had to apply it several many-to-many tables and the results are exactly what I wanted.Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.