0

I have a dataframe that consists of 2 columns - df[['c1','c2']] In those columns there are only 3 unique string values - a, b and c. I would like to convert those values into 3 numbers to perform data analysis. I think it should be a map or a dictionary, but I keep getting errors.

2 Answers 2

2

You can use pandas.factorize

Example:

import pandas as pd

data ={
    'c1': ['a', 'b', 'c'],
    'c2': ['a', 'b', 'c']
}

df = pd.DataFrame(data)
df['c1'] = pd.factorize(df['c1'])[0]
df['c2'] = pd.factorize(df['c2'])[0]
df

Output:

    c1  c2
0   0   0
1   1   1
2   2   2
Sign up to request clarification or add additional context in comments.

Comments

1

You do not necessarily need to convert them to "integers" to perform data analysis. I mean to say that you need to convert only to the format that can be helpful for your analysis type. Take this example:

df = pd.DataFrame(
    {
        "c1": ["a", "a", "a", "b", "c", "c", "c", "c"],
        "c2": ["a", "b", "a", "a", "a", "b", "b", "b"],
    }
)

You can do a distribution plot via value_counts

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
df["c1"].value_counts().plot(kind="bar", ax=ax[0])
df["c2"].value_counts().plot(kind="bar", ax=ax[1])

plt.show()

enter image description here

Or you can do a frequency chart via pie as follows

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
df["c1"].value_counts().plot(kind="pie", ax=ax[0])
df["c2"].value_counts().plot(kind="pie", ax=ax[1])
plt.show()

enter image description here

Or if you are working with seaborn, that'll make it easier as there'll be no conversion involved at all.

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.countplot(x="c1", data=df, ax=ax[0])
sns.countplot(x="c2", data=df, ax=ax[1])
plt.show()

enter image description here

Or you can do a scatter plot like this

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.scatterplot(x="c2", y="c1", data=df, ax=ax)
plt.show()

enter image description here

With that being said, it wont make your data ready for a machine learning model, so you'll need to use OneHotEncoder or LabelEncode from sklearn to convert it to a integral form.

You can do it with sklearn as follows.

For example with LabelEncoder,

le = LabelEncoder()
df["c1"] = le.fit_transform(df["c1"])
df["c2"] = le.fit_transform(df["c2"])
print(df)

This will map a,b,c to an integer and the result will be

   c1  c2
0   0   0
1   0   1
2   0   0
3   1   0
4   2   0
5   2   1
6   2   1
7   2   2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.