Label encoding across multiple columns with same attributes in sckit-learn

Question

If I have two columns as below:

Origin  Destination  
China   USA  
China   Turkey  
USA     China  
USA     Turkey  
USA     Russia  
Russia  China

How would I perform label encoding while ensuring the label for the Origin column matches the one in the destination column i.e

Origin  Destination  
0   1  
0   3  
1   0  
1   0  
1   0  
2   1

If I do the encoding for each column separately then the algorithm will see the China in column1 as different from column2 which is not the case

piRSquared · Accepted Answer · 2018-05-10 02:52:28Z

8

`stack`

df.stack().pipe(lambda s: pd.Series(pd.factorize(s.values)[0], s.index)).unstack()

   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

`factorize` with `reshape`

pd.DataFrame(
    pd.factorize(df.values.ravel())[0].reshape(df.shape),
    df.index, df.columns
)

   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

`np.unique` and `reshape`

pd.DataFrame(
    np.unique(df.values.ravel(), return_inverse=True)[1].reshape(df.shape),
    df.index, df.columns
)

   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

Disgusting Option

I couldn't stop trying stuff... sorry!

df.applymap(
    lambda x, y={}, c=itertools.count():
        y.get(x) if x in y else y.setdefault(x, next(c))
)

   Origin  Destination
0       0            1
1       0            3
2       1            0
3       1            3
4       1            2
5       2            0

As pointed out by cᴏʟᴅsᴘᴇᴇᴅ

You can shorten this by assigning back to dataframe

df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)

edited May 10, 2018 at 2:52

answered May 10, 2018 at 1:56

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

cs95 Over a year ago

You can shorten factorize: df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)

piRSquared Over a year ago

Very true! But that goes against my general tendency to avoid overwriting the dataframe (-:

sacuL · Accepted Answer · 2018-05-10 02:15:01Z

pandas Method

You could create a dictionary of {country: value} pairs and map the dataframe to that:

country_map = {country:i for i, country in enumerate(df.stack().unique())}

df['Origin'] = df['Origin'].map(country_map)    
df['Destination'] = df['Destination'].map(country_map)

>>> df
   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

sklearn method

Since you tagged sklearn, you could use LabelEncoder():

from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
le.fit(df.stack().unique())

df['Origin'] = le.transform(df['Origin'])
df['Destination'] = le.transform(df['Destination'])

>>> df
   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

To get the original labels back:

>>> le.inverse_transform(df['Origin'])
# array(['China', 'China', 'USA', 'USA', 'USA', 'Russia'], dtype=object)

BENY · Accepted Answer · 2018-05-10 02:31:49Z

5

You can using replace

df.replace(dict(zip(np.unique(df.values),list(range(len(np.unique(df.values)))))))
   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

Succinct and nice answer from Pir

df.replace((lambda u: dict(zip(u, range(u.size))))(np.unique(df)))

And

df.replace(dict(zip(np.unique(df), itertools.count())))

edited May 10, 2018 at 2:31

answered May 10, 2018 at 2:08

BENY

324k22 gold badges176 silver badges250 bronze badges

2 Comments

piRSquared Over a year ago

df.replace((lambda u: dict(zip(u, range(u.size))))(np.unique(df)))

piRSquared Over a year ago

Even better df.replace(dict(zip(np.unique(df), itertools.count())))

hilberts_drinking_problem · Accepted Answer · 2018-05-10 02:42:44Z

3

Edit: just found out about return_inverse option to np.unique. No need to search and substitute!

df.values[:] = np.unique(df, return_inverse=True)[1].reshape(-1,2)

You could leverage the vectorized version of np.searchsorted with

df.values[:] = np.searchsorted(np.sort(np.unique(df)), df)

Or you could create an array of one-hot encodings and recover indices with argmax. Probably not a great idea if there are many countries.

df.values[:] = (df.values[...,None] == np.unique(df)).argmax(-1)

edited May 10, 2018 at 2:42

answered May 10, 2018 at 2:21

hilberts_drinking_problem

11.6k3 gold badges25 silver badges55 bronze badges

Comments

niraj · Accepted Answer · 2018-05-10 02:31:05Z

0

Using LabelEncoder from sklearn, you can also try:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df.values.flatten())

df = df.apply(le.fit_transform)
print(df)

Result:

   Origin  Destination
0       0            3
1       0            2
2       2            0
3       2            2
4       2            1
5       1            0

If you have more columns and only want to apply to selected columns of dataframe then, you can try:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# columns to select for encoding
selected_col = ['Origin','Destination']
le.fit(df[selected_col].values.flatten())

df[selected_col] = df[selected_col].apply(le.fit_transform)
print(df)

edited May 10, 2018 at 2:31

answered May 10, 2018 at 2:25

niraj

18.2k4 gold badges36 silver badges50 bronze badges

Collectives™ on Stack Overflow

Label encoding across multiple columns with same attributes in sckit-learn

5 Answers 5

`stack`

`factorize` with `reshape`

`np.unique` and `reshape`

Disgusting Option

2 Comments

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

stack

factorize with reshape

np.unique and reshape

Disgusting Option

2 Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

`stack`

`factorize` with `reshape`

`np.unique` and `reshape`