0

I have a csv file structured like this:

enter image description here

As you can see, many lines are repeated (they represent the same entity) with the attribute 'category' being the only difference between each other. I would like to join those rows and include all the categories in a single value.

For example the attribute 'category' for Walmart should be: "Retail, Dowjones, SuperMarketChains".

Edit:

I would like the output table to be structured like this:

enter image description here

Edit 2:

What worked for me was:

df4.groupby(["ID azienda","Name","Company code", "Marketcap", "Share price", "Earnings", "Revenue", "Shares", "Employees"]
)['Category'].agg(list).reset_index()
2
  • please show an example of the output table you want. you might look into pivot Commented Jan 30, 2023 at 19:27
  • 2
    df.groupby(grp_by_cols)['Category'].agg(list).reset_index()? where grp_by_cols is a list of column names: ['ID', 'Name', 'Company code', . . . ] Or you can groupby the id column, transform and drop duplicates. Commented Jan 30, 2023 at 19:30

2 Answers 2

1

Quick and Dirty

df2=df.groupby("Name")['Category'].apply(','.join)

subst=dict(df2)
df['category']=df['Name'].replace(subst)
df.drop_duplicates('Name')

if you prefer multiple categories to be stored as a list in pandas column category... change first line to

df2=df.groupby("Name")['Category'].apply(list)
Sign up to request clarification or add additional context in comments.

3 Comments

I went with the second option. What does "name" between parenthesis stand for? I'm now getting an error: "sequence item 0: expected str instance, float found"
Column "Name" .. if you want to use "Company Code" change "name" to "Company Code" . also edited the incomplete code
the last line will remove duplicate rows after concatenating the categories..so there will be only one amazon with multiple categories and one walmart with multiple categories
0

Not sure if you want a new table or just a list of the categories. Below is how you could make a table with the hashes if those are important

import pandas as pd
df = pd.DataFrame({
    'Name':['W','W','W','A','A','A'],
    'Category':['Retail','Dow','Chain','Ecom','Internet','Dow'],
    'Hash':[1,2,3,4,5,6],
})

# print(df)
#   Name  Category  Hash
# 0    W    Retail     1
# 1    W       Dow     2
# 2    W     Chain     3
# 3    A      Ecom     4
# 4    A  Internet     5
# 5    A       Dow     6

#Make a new df which has one row per company and one column per category, values are hashes
piv_df = df.pivot(
    index = 'Name',
    columns = 'Category',
    values = 'Hash',
)

# print(piv_df)
# Category  Chain  Dow  Ecom  Internet  Retail
# Name                                        
# A           NaN  6.0   4.0       5.0     NaN
# W           3.0  2.0   NaN       NaN     1.0

1 Comment

I would like a new table with the same columns as the initial one (same schema), just without the repetition of the same entity because of different categories

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.