How to encode a dataframe in Python from strings to integers?

Question

I have a dataframe that consists of 2 columns - df[['c1','c2']] In those columns there are only 3 unique string values - a, b and c. I would like to convert those values into 3 numbers to perform data analysis. I think it should be a map or a dictionary, but I keep getting errors.

Shahab Rahnama · Accepted Answer · 2022-09-11 14:52:51Z

2

You can use pandas.factorize

Example:

import pandas as pd

data ={
    'c1': ['a', 'b', 'c'],
    'c2': ['a', 'b', 'c']
}

df = pd.DataFrame(data)
df['c1'] = pd.factorize(df['c1'])[0]
df['c2'] = pd.factorize(df['c2'])[0]
df

Output:

answered Sep 11, 2022 at 14:52

Shahab Rahnama

1,0321 gold badge8 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ahmad Anis · Accepted Answer · 2022-09-11 14:56:03Z

You do not necessarily need to convert them to "integers" to perform data analysis. I mean to say that you need to convert only to the format that can be helpful for your analysis type. Take this example:

df = pd.DataFrame(
    {
        "c1": ["a", "a", "a", "b", "c", "c", "c", "c"],
        "c2": ["a", "b", "a", "a", "a", "b", "b", "b"],
    }
)

You can do a distribution plot via value_counts

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
df["c1"].value_counts().plot(kind="bar", ax=ax[0])
df["c2"].value_counts().plot(kind="bar", ax=ax[1])

plt.show()

Or you can do a frequency chart via pie as follows

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
df["c1"].value_counts().plot(kind="pie", ax=ax[0])
df["c2"].value_counts().plot(kind="pie", ax=ax[1])
plt.show()

Or if you are working with seaborn, that'll make it easier as there'll be no conversion involved at all.

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.countplot(x="c1", data=df, ax=ax[0])
sns.countplot(x="c2", data=df, ax=ax[1])
plt.show()

Or you can do a scatter plot like this

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.scatterplot(x="c2", y="c1", data=df, ax=ax)
plt.show()

With that being said, it wont make your data ready for a machine learning model, so you'll need to use OneHotEncoder or LabelEncode from sklearn to convert it to a integral form.

You can do it with sklearn as follows.

For example with LabelEncoder,

le = LabelEncoder()
df["c1"] = le.fit_transform(df["c1"])
df["c2"] = le.fit_transform(df["c2"])
print(df)

This will map a,b,c to an integer and the result will be

Collectives™ on Stack Overflow

How to encode a dataframe in Python from strings to integers?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related