2

I have a data set. It has some string columns. I want to convert these string columns. I'm developing a Neural network using this data set. But since the dataset has some string values I can't train my Neural network. What is the best way to convert these string values to Neural Network readable format?

This is the data set that I have

type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,1,0
PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,0,1

I want to convert those type,nameOrig,nameDest fields to neural network readable format.

I have used below method. But I don't know wheater it's right or wrong.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

test_set = pd.read_csv('cs.csv')
new_test_set['type'] = enc.fit(new_test_set['type'])

I have gone through below questions. But most of them are not worked for me

How to convert string based data frame to numeric

converting non-numeric to numeric value using Panda libraries

converting non-numeric to numeric value using Panda libraries

3
  • 2
    "Most of them are not worked" - why not? What happened? What did you expect? Commented Jan 5, 2019 at 19:41
  • According to the third link question that I have added in links, I'm using the LabelEncoder. But others are gave me some errors Commented Jan 5, 2019 at 19:56
  • You should use the LabelEncoder. What's wrong with that? Commented Jan 5, 2019 at 20:02

3 Answers 3

2

In this case you can use the datatype category of pandas to map strings to indices (see categorical data). So it's not necessary to use LabelEncoder or OneHotEncoder of scikit-learn.

import pandas as pd

df = pd.read_csv('54055554.csv', header=0, dtype={
    'type': 'category',  # <--
    'amount': float,
    'nameOrig': str,
    'oldbalanceOrg': float,
    'newbalanceOrig': float,
    'nameDest': str,
    'oldbalanceDest': float,
    'newbalanceDest': float,
    'isFraud': bool,
    'isFlaggedFraud': bool
})

print(dict(enumerate(df['type'].cat.categories)))
# {0: 'PAYMENT', 1: 'TRANSFER'}

print(list(df['type'].cat.codes))
# [0, 0, 1]

The data from the CSV:

type, ...
PAYMENT, ...
PAYMENT, ...
TRANSFER, ...
Sign up to request clarification or add additional context in comments.

Comments

2

Transformation

First you need to transform the three columns using LableEncoder class.

Encoding Categorical Data

Well here you have the type as categorical value. For this you can use the class OneHotEncoder available in sklearn.preprocessing.

Avoiding Dummy Variable Trap

Then you need to avoid the Dummy Variable Trap by removing any one of the column that are being used to represent type.

Code

Here I have put the sample code for your reference.

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

dataset = pd.read_csv('cs.csv')
X = dataset.iloc[:].values

labelencoder = LabelEncoder()

X[:, 0] = labelencoder.fit_transform(X[:, 0])
X[:, 2] = labelencoder.fit_transform(X[:, 2])
X[:, 5] = labelencoder.fit_transform(X[:, 5])

onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap
X = X[:, 1:]

Comments

2

You need to encode the string values into numeric ones. What I usually do in this case is creating a table by a non numeric feature, the created table contains all the possible value of that feature. And then, the index of the value in the corresponding features table is used when training a model.

Example:

type_values = ['PAYMENT', 'TRANSFER']

1 Comment

Surely using the standard LabelEncoder is preferred to any ad hoc solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.