How to remove duplicates in a data frame using Python

Question

So the data frame is

Product    Price  Weight  Range   Count
   A        40      20      1-3     20
   A        40      20      4-7     23
   B        20      73      1-3     54
   B        20      73      4-7     43
   B        20      73      8-15    34
   B        20      73      >=16    12
   C        10      20      4-7     22

So basically there is a product with price and weight and the range here specifies the no of days the product was sold continuously and the count specifies the count of products sold in that range

Expected Output

Product    Price  Weight  Range   Count
   A        40      20      1-3     20
                            4-7     23
   B        20      73      1-3     54
                            4-7     43
                            8-15    34
   B        20      73      >=16    12
   C        10      20      4-7     22

or

   Product  Price  Weight  1-3   4-7   8-15  >=16
   A        40      20     20     23   NaN    NaN
   B        20      73     54     43   34     1
   C        10      20      0     22   NaN    NaN

Did you try my solution at all? It's almost the same as the answer you just accepted. — cs95
– cs95, Commented Jun 6, 2018 at 6:18

cs95 · Accepted Answer · 2018-06-06 05:13:56Z

3

Fulfilling the second output makes more sense than the first. Use set_index, followed by unstack.

(df.set_index(['Product', 'Price', 'Weight', 'Range'])
  .Count
  .unstack(fill_value=0)
  .reset_index()
)

Range Product  Price  Weight  1-3  4-7  8-15  >=16
0           A     40      20   20   23     0     0
1           B     20      73   54   43    34    12
2           C     10     100    0   22     0     0

answered Jun 6, 2018 at 5:13

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2018-06-06 06:17:00Z

2

In my opinion first solution is not recommended if need processes DataFrame later.

Second solution is much better and if duplicates in real data is necessary aggregate values, e.g. by sum:

#convert catagoricals to strings
df['Range'] = df['Range'].astype(str)

df = (df.groupby(['Product', 'Price', 'Weight', 'Range'])['Count']
        .sum()
        .unstack(fill_value=0)
        .reset_index())
print (df)
Range Product  Price  Weight  1-3  4-7  8-15  >=16
0           A     40      20   20   23     0     0
1           B     20      73   54   43    34    12
2           C     10      20    0   22     0     0

answered Jun 6, 2018 at 6:17

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

5 Comments

cs95 Over a year ago

Something tells me this OP is biased. Two nearly identical solutions, one posted an hour earlier. I bet set index is faster than groupby too, but it seems they don't care. Oh well, you are a lucky man. Enjoy

san Over a year ago

its was giving an error:cannot insert an item into a CategoricalIndex that is not already an existing category

san Over a year ago

i just saw this code:df['Range'] = df['Range'].astype(str) and worked perfectly

cs95 Over a year ago

@san Thanks for not telling me about it. I wouldn't know there was an error if you didn't tell me, because I can't read minds.

san Over a year ago

sry mate won't happen again no offense

Mohamed Thasin ah · Accepted Answer · 2018-06-06 05:31:45Z

1

try this,

mask=df.duplicated(subset=['Product'])
df.loc[mask,['Product','Price','Weight']]=''

Output:

  Product Price Weight Range  Count
0       A    40     20   1-3     20
1                        4-7     23
2       B    20     73   1-3     54
3                        4-7     43
4                       8-15     34
5                       >=16     12
6       C    10    100   4-7     22

.

pd.pivot_table(df,index=['Product','Price','Weight'],columns='Range',values='Count').reset_index()

Output:

Range Product  Price  Weight   1-3   4-7  8-15  >=16
0           A     40      20  20.0  23.0   NaN   NaN
1           B     20      73  54.0  43.0  34.0  12.0
2           C     10     100   NaN  22.0   NaN   NaN

edited Jun 6, 2018 at 5:31

answered Jun 6, 2018 at 5:21

Mohamed Thasin ah

11.2k11 gold badges65 silver badges120 bronze badges

Collectives™ on Stack Overflow

How to remove duplicates in a data frame using Python

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related