How can I replace Python Pandas table text values with unique IDs?

Question

I am using Pandas to read a file in this format:

fp = pandas.read_table("Measurements.txt")
fp.head()

"Aaron", 3, 5, 7  
"Aaron", 3, 6, 9  
"Aaron", 3, 6, 10 
"Brave", 4, 6, 0 
"Brave", 3, 6, 1

I want to replace each name with a unique ID so output looks like:

"1", 3, 5, 7 
"1", 3, 6, 9 
"1", 3, 6, 10 
"2", 4, 6, 0 
"2", 3, 6, 1

How can I do that?

Thanks!

Community · Accepted Answer · 2017-05-23 11:55:16Z

I would make use of categorical dtype:

In [97]: x['ID'] = x.name.astype('category').cat.rename_categories(range(1, x.name.nunique()+1))

In [98]: x
Out[98]:
    name  v1  v2  v3 ID
0  Aaron   3   5   7  1
1  Aaron   3   6   9  1
2  Aaron   3   6  10  1
3  Brave   4   6   0  2
4  Brave   3   6   1  2

if you need string IDs instead of numerical ones, you can use:

x.name.astype('category').cat.rename_categories([str(x) for x in range(1,x.name.nunique()+1)])

or, as @MedAli has mentioned in his answer, using factorize() method - demo:

In [141]: x['cat'] = pd.Categorical((pd.factorize(x.name)[0] + 1).astype(str))

In [142]: x
Out[142]:
    name  v1  v2  v3 ID cat
0  Aaron   3   5   7  1   1
1  Aaron   3   6   9  1   1
2  Aaron   3   6  10  1   1
3  Brave   4   6   0  2   2
4  Brave   3   6   1  2   2

In [143]: x.dtypes
Out[143]:
name      object
v1         int64
v2         int64
v3         int64
ID      category
cat     category
dtype: object

In [144]: x['cat'].cat.categories
Out[144]: Index(['1', '2'], dtype='object')

or having categories as integer numbers:

In [154]: x['cat'] = pd.Categorical((pd.factorize(x.name)[0] + 1))

In [155]: x
Out[155]:
    name  v1  v2  v3 ID cat
0  Aaron   3   5   7  1   1
1  Aaron   3   6   9  1   1
2  Aaron   3   6  10  1   1
3  Brave   4   6   0  2   2
4  Brave   3   6   1  2   2

In [156]: x['cat'].cat.categories
Out[156]: Int64Index([1, 2], dtype='int64')

explanation:

In [99]: x.name.astype('category')
Out[99]:
0    Aaron
1    Aaron
2    Aaron
3    Brave
4    Brave
Name: name, dtype: category
Categories (2, object): [Aaron, Brave]

In [100]: x.name.astype('category').cat.categories
Out[100]: Index(['Aaron', 'Brave'], dtype='object')

In [101]: x.name.astype('category').cat.rename_categories([1,2])
Out[101]:
0    1
1    1
2    1
3    2
4    2
dtype: category
Categories (2, int64): [1, 2]

explanation for the factorize() method:

In [157]: (pd.factorize(x.name)[0] + 1)
Out[157]: array([1, 1, 1, 2, 2])

In [158]: pd.Categorical((pd.factorize(x.name)[0] + 1))
Out[158]:
[1, 1, 1, 2, 2]
Categories (2, int64): [1, 2]

Thank you - experimenting with your suggestions now! I have hacked together a function which does the job for now, but your code seems like a more elegant solution.

Community · Accepted Answer · 2017-05-23 11:55:16Z

4

You can do that via a simple dictionary mapping. Say for instance your data looks like this:

col1, col2, col3, col4
"Aaron", 3, 5, 7  
"Aaron", 3, 6, 9  
"Aaron", 3, 6, 10 
"Brave", 4, 6, 0 
"Brave", 3, 6, 1

then simply do

myDict = {"Aaron":"1", "Brave":"2"}
fp["col1"] = fp["col1"].map(myDict)

if you don't want to construct a dictionary use pandas.factorize which is going to take care of encoding the column for you starting from 0. You can find an example on how to use it here.

edited May 23, 2017 at 11:55

CommunityBot

11 silver badge

answered Aug 14, 2016 at 19:38

Mohamed Ali JAMAOUI

14.8k14 gold badges79 silver badges124 bronze badges

1 Comment

Kingua Over a year ago

Thank you! Number of names is pretty large, so I am experimenting with factorize now. I have hacked together a function which does the job for now, but pandas.factorize seems like a more elegant solution.

ChE · Accepted Answer · 2016-08-14 19:40:31Z

0

Why not using a hash on the name

df["col0"] = df["col0"].apply(lambda x: hashlib.sha256(x.encode("utf-8")).hexdigest())

In this way you do not need to care about the names that occurring, i.e. you do not need to know them upfront to build a dictionary for mapping.

answered Aug 14, 2016 at 19:40

ChE

887 bronze badges

1 Comment

Kingua Over a year ago

Thank you - I considered that, but the library I am using requires simple numeric values.

Community · Accepted Answer · 2017-05-23 10:28:41Z

0

It looks like this Replace all occurrences of a string in a pandas dataframe might hold your answer. According to the documentation, pandas.read_table creates a dataframe and a dataframe has a replace function.

fp.replace({'Aaron': '1'}, regex=True)

Although you probably don't need to have the regex=True part as it's a direct replacement in full.

edited May 23, 2017 at 10:28

CommunityBot

11 silver badge

answered Aug 14, 2016 at 19:55

Alan

3,0723 gold badges17 silver badges27 bronze badges

1 Comment

Kingua Over a year ago

Thank you - I have been doing this for fewer names, but in this case number of names is >1000 so one-by-one replacement is not an option. I have hacked together a function which does the job for now, but pandas.factorize seems like a more elegant solution.

Contango · Accepted Answer · 2021-09-01 19:24:21Z

0

This works, with caveats:

df = pd.DataFrame({"string_column": ["string1", "string2"]})
df["hash"] = [hash(i) for i in df["string_column"]]
df
Out[1]: 
  string_column                 hash
0       string1 -2164478207308662971
1       string2 -3208847000100121065

And the caveat: the hash is not 100% guaranteed to be unique. There is a small chance that two different strings could have a the same hash. However, it is a 20-digit decimal, so there is a good chance that it is unique.

The best way is to determine all unique values in the column, assign an incrementing number to each of them, then iterate through the values in the column and replace with the ID. However, this is slow, and the line above is faster.

answered Sep 1, 2021 at 19:24

Contango

81k59 gold badges283 silver badges324 bronze badges

Collectives™ on Stack Overflow

How can I replace Python Pandas table text values with unique IDs?

5 Answers 5

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related