Join rows and concatenate attribute values in a csv file with pandas

Question

I have a csv file structured like this:

As you can see, many lines are repeated (they represent the same entity) with the attribute 'category' being the only difference between each other. I would like to join those rows and include all the categories in a single value.

For example the attribute 'category' for Walmart should be: "Retail, Dowjones, SuperMarketChains".

Edit:

I would like the output table to be structured like this:

Edit 2:

What worked for me was:

df4.groupby(["ID azienda","Name","Company code", "Marketcap", "Share price", "Earnings", "Revenue", "Shares", "Employees"]
)['Category'].agg(list).reset_index()

please show an example of the output table you want. you might look into pivot — mitoRibo
– mitoRibo, Commented Jan 30, 2023 at 19:27
df.groupby(grp_by_cols)['Category'].agg(list).reset_index()? where grp_by_cols is a list of column names: ['ID', 'Name', 'Company code', . . . ] Or you can groupby the id column, transform and drop duplicates. — It_is_Chris
– It_is_Chris, Commented Jan 30, 2023 at 19:30

geekay · Accepted Answer · 2023-01-30 20:09:43Z

1

Quick and Dirty

df2=df.groupby("Name")['Category'].apply(','.join)

subst=dict(df2)
df['category']=df['Name'].replace(subst)
df.drop_duplicates('Name')

if you prefer multiple categories to be stored as a list in pandas column category... change first line to

df2=df.groupby("Name")['Category'].apply(list)

edited Jan 30, 2023 at 20:09

answered Jan 30, 2023 at 19:49

geekay

4502 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rodolfo Over a year ago

I went with the second option. What does "name" between parenthesis stand for? I'm now getting an error: "sequence item 0: expected str instance, float found"

geekay Over a year ago

Column "Name" .. if you want to use "Company Code" change "name" to "Company Code" . also edited the incomplete code

geekay Over a year ago

the last line will remove duplicate rows after concatenating the categories..so there will be only one amazon with multiple categories and one walmart with multiple categories

mitoRibo · Accepted Answer · 2023-01-30 19:31:13Z

0

Not sure if you want a new table or just a list of the categories. Below is how you could make a table with the hashes if those are important

import pandas as pd
df = pd.DataFrame({
    'Name':['W','W','W','A','A','A'],
    'Category':['Retail','Dow','Chain','Ecom','Internet','Dow'],
    'Hash':[1,2,3,4,5,6],
})

# print(df)
#   Name  Category  Hash
# 0    W    Retail     1
# 1    W       Dow     2
# 2    W     Chain     3
# 3    A      Ecom     4
# 4    A  Internet     5
# 5    A       Dow     6

#Make a new df which has one row per company and one column per category, values are hashes
piv_df = df.pivot(
    index = 'Name',
    columns = 'Category',
    values = 'Hash',
)

# print(piv_df)
# Category  Chain  Dow  Ecom  Internet  Retail
# Name                                        
# A           NaN  6.0   4.0       5.0     NaN
# W           3.0  2.0   NaN       NaN     1.0

answered Jan 30, 2023 at 19:31

mitoRibo

4,5981 gold badge16 silver badges24 bronze badges

1 Comment

Rodolfo Over a year ago

I would like a new table with the same columns as the initial one (same schema), just without the repetition of the same entity because of different categories

Collectives™ on Stack Overflow

Join rows and concatenate attribute values in a csv file with pandas

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related