Combining values from multiple rows into a single row

Question

I am working with several tables which have many-to-many relationships. What is the most efficient way to transform this data to ensure that the category column is unique and that all of the corresponding units are combined into a single row?

category    unit
A01         97337
A01         97333
A01         97334
A01         97343
A01         26223
A01         26226
A01         22722
A01         93397
A01         97332
A01         97342
A01         97369
A01         97734
A01         97332
P76         97343
P76         26223
P76         27399
P76         27277
P76         27234
P76         27297
P76         27292
P76         22723
P76         93622
P76         27343
P76         27234
P98         97337

Into this:

category    category_units
 A01        97337, 97333, 97334, 97343, 26223, 26226, 22722, 93397, 97332, 97342, 97369, 97734, 97332
 P76        97343, 26223, 93622, 99733, 27399, 27277, 27234, 27297, 27292
 P98        97337

One row per category (serves as a primary key) where each of the corresponding units are concatenated into a single column with values separated by a comma.

I would be joining this data back to another fact table and eventually the end user would filter for category_units where it 'contains' some value so it would pull up all rows which are associated with that value.

jezrael · Accepted Answer · 2017-02-17 14:25:53Z

4

You can use groupby with apply join, if unit column is numeric is necessary cast to string:

df1 = df.groupby('category')['unit']
        .apply(lambda x: ', '.join(x.astype(str)))
        .reset_index()
print (df1)
  category                                               unit
0      A01  97337, 97333, 97334, 97343, 26223, 26226, 2272...
1      P76  97343, 26223, 27399, 27277, 27234, 27297, 2729...
2      P98                                              97337

Another solution with casting first:

df.unit = df.unit.astype(str)
df1 = df.groupby('category')['unit'].apply(', '.join).reset_index()
print (df1)
  category                                               unit
0      A01  97337, 97333, 97334, 97343, 26223, 26226, 2272...
1      P76  97343, 26223, 27399, 27277, 27234, 27297, 2729...
2      P98                                              97337

edited Feb 17, 2017 at 14:25

answered Feb 17, 2017 at 14:18

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

trench Over a year ago

Cool, works great. Due to the nature of some of my joins (a lot of intermediate tables), my end result has some duplicates within a row. for example, category = A01, units= 97337, 26223, 97337. Is there a way to cleanly remove the duplicates on a row level? I was thinking of using .str.split() but then I didn't know how to retain only the unique values per row.

jezrael Over a year ago

You can use set or unique like df1 = df.groupby('category')['unit'] .apply(lambda x: ', '.join(x.unique().astype(str))) or df1 = df.groupby('category')['unit'] .apply(lambda x: ', '.join(set(x.astype(str))))

trench Over a year ago

This all worked perfectly. I actually had to apply it several many-to-many tables and the results are exactly what I wanted.Thanks

Collectives™ on Stack Overflow

Combining values from multiple rows into a single row

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related