11

If I have two columns as below:

Origin  Destination  
China   USA  
China   Turkey  
USA     China  
USA     Turkey  
USA     Russia  
Russia  China  

How would I perform label encoding while ensuring the label for the Origin column matches the one in the destination column i.e

Origin  Destination  
0   1  
0   3  
1   0  
1   0  
1   0  
2   1  

If I do the encoding for each column separately then the algorithm will see the China in column1 as different from column2 which is not the case

5 Answers 5

8

stack

df.stack().pipe(lambda s: pd.Series(pd.factorize(s.values)[0], s.index)).unstack()

   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

factorize with reshape

pd.DataFrame(
    pd.factorize(df.values.ravel())[0].reshape(df.shape),
    df.index, df.columns
)

   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

np.unique and reshape

pd.DataFrame(
    np.unique(df.values.ravel(), return_inverse=True)[1].reshape(df.shape),
    df.index, df.columns
)

   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

Disgusting Option

I couldn't stop trying stuff... sorry!

df.applymap(
    lambda x, y={}, c=itertools.count():
        y.get(x) if x in y else y.setdefault(x, next(c))
)

   Origin  Destination
0       0            1
1       0            3
2       1            0
3       1            3
4       1            2
5       2            0

As pointed out by cᴏʟᴅsᴘᴇᴇᴅ

You can shorten this by assigning back to dataframe

df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)
Sign up to request clarification or add additional context in comments.

2 Comments

You can shorten factorize: df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)
Very true! But that goes against my general tendency to avoid overwriting the dataframe (-:
7

pandas Method

You could create a dictionary of {country: value} pairs and map the dataframe to that:

country_map = {country:i for i, country in enumerate(df.stack().unique())}

df['Origin'] = df['Origin'].map(country_map)    
df['Destination'] = df['Destination'].map(country_map)

>>> df
   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

sklearn method

Since you tagged sklearn, you could use LabelEncoder():

from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
le.fit(df.stack().unique())

df['Origin'] = le.transform(df['Origin'])
df['Destination'] = le.transform(df['Destination'])

>>> df
   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

To get the original labels back:

>>> le.inverse_transform(df['Origin'])
# array(['China', 'China', 'USA', 'USA', 'USA', 'Russia'], dtype=object)

Comments

5

You can using replace

df.replace(dict(zip(np.unique(df.values),list(range(len(np.unique(df.values)))))))
   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

Succinct and nice answer from Pir

df.replace((lambda u: dict(zip(u, range(u.size))))(np.unique(df)))

And

df.replace(dict(zip(np.unique(df), itertools.count())))

2 Comments

df.replace((lambda u: dict(zip(u, range(u.size))))(np.unique(df)))
Even better df.replace(dict(zip(np.unique(df), itertools.count())))
3

Edit: just found out about return_inverse option to np.unique. No need to search and substitute!

df.values[:] = np.unique(df, return_inverse=True)[1].reshape(-1,2)

You could leverage the vectorized version of np.searchsorted with

df.values[:] = np.searchsorted(np.sort(np.unique(df)), df)

Or you could create an array of one-hot encodings and recover indices with argmax. Probably not a great idea if there are many countries.

df.values[:] = (df.values[...,None] == np.unique(df)).argmax(-1)

Comments

0

Using LabelEncoder from sklearn, you can also try:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df.values.flatten())

df = df.apply(le.fit_transform)
print(df)

Result:

   Origin  Destination
0       0            3
1       0            2
2       2            0
3       2            2
4       2            1
5       1            0

If you have more columns and only want to apply to selected columns of dataframe then, you can try:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# columns to select for encoding
selected_col = ['Origin','Destination']
le.fit(df[selected_col].values.flatten())

df[selected_col] = df[selected_col].apply(le.fit_transform)
print(df)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.