Python pandas dataframe with duplicate values

Question

I am looking to index the following pandas dataframe with the following sample values. The dataframe has a lot of duplicates.

ID      AccountName
83      CHRISTIAN UNIVERSITY
83      CHRISTIAN UNIVERSITY
83      CHRISTIAN UNIVERSITY
83      CHRISTIAN UNIVERSITY
104     UNIVERSITY
104     UNIVERSITY
1740    ELECTRIC CORPORATIO
1740    ELECTRIC CORPORATIO
1740    ELECTRIC CORPORATIO
1740    ELECTRIC CORPORATIO
...

The resulting dataframe should be the following.

  ID        index   AccountName
  83            1   CHRISTIAN UNIVERSITY
  83            1   CHRISTIAN UNIVERSITY
  83            1   CHRISTIAN UNIVERSITY
  83            1   CHRISTIAN UNIVERSITY
 104            2   UNIVERSITY
 104            2   UNIVERSITY
1740            3   ELECTRIC CORPORATIO
1740            3   ELECTRIC CORPORATIO
1740            3   ELECTRIC CORPORATIO
1740            3   ELECTRIC CORPORATIO
...

Does anyone have an fast and efficient way of doing this?

Does the order of the groups matter? Or just that duplicates all have the same group? — user3483203
– user3483203, Commented Jun 19, 2018 at 14:39
If you don't care about the order, you could use df.groupby('AccountName').ngroup() + 1 — user3483203
– user3483203, Commented Jun 19, 2018 at 14:42

DSM · Accepted Answer · 2018-06-19 14:40:52Z

Assuming that you want an increasing index for each new ID, I'd do:

In [43]: df["number"] = df.ID.rank(method='dense').astype(int)

In [44]: df
Out[44]: 
     ID           AccountName  number
0    83  CHRISTIAN UNIVERSITY       1
1    83  CHRISTIAN UNIVERSITY       1
2    83  CHRISTIAN UNIVERSITY       1
3    83  CHRISTIAN UNIVERSITY       1
4   104            UNIVERSITY       2
5   104            UNIVERSITY       2
6  1740   ELECTRIC CORPORATIO       3
7  1740   ELECTRIC CORPORATIO       3
8  1740   ELECTRIC CORPORATIO       3
9  1740   ELECTRIC CORPORATIO       3

which will give the lowest ID the number 1, and the second lowest 2, etc., independent of the order they actually appear in the frame (e.g. if you put ELECTRIC_CORPORATIO second, it'll still get #3 because 1740 is the third number.)

There are other ways if you can be guaranteed that your clusters are contiguous, e.g.

(~df["ID"].duplicated()).cumsum()

but that's much less reliable in general than mapping a unique ID to a unique number, IMHO.

Also, I've used "number" here as the column name rather than "index", because that causes confusion between the frame's index and your column named "index".

Collectives™ on Stack Overflow

Python pandas dataframe with duplicate values

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related