5

I am using Pandas to read a file in this format:

fp = pandas.read_table("Measurements.txt")
fp.head()

"Aaron", 3, 5, 7  
"Aaron", 3, 6, 9  
"Aaron", 3, 6, 10 
"Brave", 4, 6, 0 
"Brave", 3, 6, 1

I want to replace each name with a unique ID so output looks like:

"1", 3, 5, 7 
"1", 3, 6, 9 
"1", 3, 6, 10 
"2", 4, 6, 0 
"2", 3, 6, 1

How can I do that?

Thanks!

5 Answers 5

7

I would make use of categorical dtype:

In [97]: x['ID'] = x.name.astype('category').cat.rename_categories(range(1, x.name.nunique()+1))

In [98]: x
Out[98]:
    name  v1  v2  v3 ID
0  Aaron   3   5   7  1
1  Aaron   3   6   9  1
2  Aaron   3   6  10  1
3  Brave   4   6   0  2
4  Brave   3   6   1  2

if you need string IDs instead of numerical ones, you can use:

x.name.astype('category').cat.rename_categories([str(x) for x in range(1,x.name.nunique()+1)])

or, as @MedAli has mentioned in his answer, using factorize() method - demo:

In [141]: x['cat'] = pd.Categorical((pd.factorize(x.name)[0] + 1).astype(str))

In [142]: x
Out[142]:
    name  v1  v2  v3 ID cat
0  Aaron   3   5   7  1   1
1  Aaron   3   6   9  1   1
2  Aaron   3   6  10  1   1
3  Brave   4   6   0  2   2
4  Brave   3   6   1  2   2

In [143]: x.dtypes
Out[143]:
name      object
v1         int64
v2         int64
v3         int64
ID      category
cat     category
dtype: object

In [144]: x['cat'].cat.categories
Out[144]: Index(['1', '2'], dtype='object')

or having categories as integer numbers:

In [154]: x['cat'] = pd.Categorical((pd.factorize(x.name)[0] + 1))

In [155]: x
Out[155]:
    name  v1  v2  v3 ID cat
0  Aaron   3   5   7  1   1
1  Aaron   3   6   9  1   1
2  Aaron   3   6  10  1   1
3  Brave   4   6   0  2   2
4  Brave   3   6   1  2   2

In [156]: x['cat'].cat.categories
Out[156]: Int64Index([1, 2], dtype='int64')

explanation:

In [99]: x.name.astype('category')
Out[99]:
0    Aaron
1    Aaron
2    Aaron
3    Brave
4    Brave
Name: name, dtype: category
Categories (2, object): [Aaron, Brave]

In [100]: x.name.astype('category').cat.categories
Out[100]: Index(['Aaron', 'Brave'], dtype='object')

In [101]: x.name.astype('category').cat.rename_categories([1,2])
Out[101]:
0    1
1    1
2    1
3    2
4    2
dtype: category
Categories (2, int64): [1, 2]

explanation for the factorize() method:

In [157]: (pd.factorize(x.name)[0] + 1)
Out[157]: array([1, 1, 1, 2, 2])

In [158]: pd.Categorical((pd.factorize(x.name)[0] + 1))
Out[158]:
[1, 1, 1, 2, 2]
Categories (2, int64): [1, 2]
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you - experimenting with your suggestions now! I have hacked together a function which does the job for now, but your code seems like a more elegant solution.
4

You can do that via a simple dictionary mapping. Say for instance your data looks like this:

col1, col2, col3, col4
"Aaron", 3, 5, 7  
"Aaron", 3, 6, 9  
"Aaron", 3, 6, 10 
"Brave", 4, 6, 0 
"Brave", 3, 6, 1

then simply do

myDict = {"Aaron":"1", "Brave":"2"}
fp["col1"] = fp["col1"].map(myDict)

if you don't want to construct a dictionary use pandas.factorize which is going to take care of encoding the column for you starting from 0. You can find an example on how to use it here.

1 Comment

Thank you! Number of names is pretty large, so I am experimenting with factorize now. I have hacked together a function which does the job for now, but pandas.factorize seems like a more elegant solution.
0

Why not using a hash on the name

df["col0"] = df["col0"].apply(lambda x: hashlib.sha256(x.encode("utf-8")).hexdigest())

In this way you do not need to care about the names that occurring, i.e. you do not need to know them upfront to build a dictionary for mapping.

1 Comment

Thank you - I considered that, but the library I am using requires simple numeric values.
0

It looks like this Replace all occurrences of a string in a pandas dataframe might hold your answer. According to the documentation, pandas.read_table creates a dataframe and a dataframe has a replace function.

fp.replace({'Aaron': '1'}, regex=True)

Although you probably don't need to have the regex=True part as it's a direct replacement in full.

1 Comment

Thank you - I have been doing this for fewer names, but in this case number of names is >1000 so one-by-one replacement is not an option. I have hacked together a function which does the job for now, but pandas.factorize seems like a more elegant solution.
0

This works, with caveats:

df = pd.DataFrame({"string_column": ["string1", "string2"]})
df["hash"] = [hash(i) for i in df["string_column"]]
df
Out[1]: 
  string_column                 hash
0       string1 -2164478207308662971
1       string2 -3208847000100121065

And the caveat: the hash is not 100% guaranteed to be unique. There is a small chance that two different strings could have a the same hash. However, it is a 20-digit decimal, so there is a good chance that it is unique.

The best way is to determine all unique values in the column, assign an incrementing number to each of them, then iterate through the values in the column and replace with the ID. However, this is slow, and the line above is faster.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.