1

I am looking to index the following pandas dataframe with the following sample values. The dataframe has a lot of duplicates.

ID      AccountName
83      CHRISTIAN UNIVERSITY
83      CHRISTIAN UNIVERSITY
83      CHRISTIAN UNIVERSITY
83      CHRISTIAN UNIVERSITY
104     UNIVERSITY
104     UNIVERSITY
1740    ELECTRIC CORPORATIO
1740    ELECTRIC CORPORATIO
1740    ELECTRIC CORPORATIO
1740    ELECTRIC CORPORATIO
...

The resulting dataframe should be the following.

  ID        index   AccountName
  83            1   CHRISTIAN UNIVERSITY
  83            1   CHRISTIAN UNIVERSITY
  83            1   CHRISTIAN UNIVERSITY
  83            1   CHRISTIAN UNIVERSITY
 104            2   UNIVERSITY
 104            2   UNIVERSITY
1740            3   ELECTRIC CORPORATIO
1740            3   ELECTRIC CORPORATIO
1740            3   ELECTRIC CORPORATIO
1740            3   ELECTRIC CORPORATIO
...

Does anyone have an fast and efficient way of doing this?

2
  • 2
    Does the order of the groups matter? Or just that duplicates all have the same group? Commented Jun 19, 2018 at 14:39
  • If you don't care about the order, you could use df.groupby('AccountName').ngroup() + 1 Commented Jun 19, 2018 at 14:42

1 Answer 1

4

Assuming that you want an increasing index for each new ID, I'd do:

In [43]: df["number"] = df.ID.rank(method='dense').astype(int)

In [44]: df
Out[44]: 
     ID           AccountName  number
0    83  CHRISTIAN UNIVERSITY       1
1    83  CHRISTIAN UNIVERSITY       1
2    83  CHRISTIAN UNIVERSITY       1
3    83  CHRISTIAN UNIVERSITY       1
4   104            UNIVERSITY       2
5   104            UNIVERSITY       2
6  1740   ELECTRIC CORPORATIO       3
7  1740   ELECTRIC CORPORATIO       3
8  1740   ELECTRIC CORPORATIO       3
9  1740   ELECTRIC CORPORATIO       3

which will give the lowest ID the number 1, and the second lowest 2, etc., independent of the order they actually appear in the frame (e.g. if you put ELECTRIC_CORPORATIO second, it'll still get #3 because 1740 is the third number.)

There are other ways if you can be guaranteed that your clusters are contiguous, e.g.

(~df["ID"].duplicated()).cumsum()

but that's much less reliable in general than mapping a unique ID to a unique number, IMHO.

Also, I've used "number" here as the column name rather than "index", because that causes confusion between the frame's index and your column named "index".

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.