Sk learn could not convert string to float

Question

I have a CSV file of

lemma,trained
iran seizes bitcoin mining machines power spike,-1
... (goes on for 1054 lines)

And my code looks like:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

df = pd.read_csv('lemma copy.csv')
X = df.iloc[:, 0].values
y = df.iloc[:, 1].values
print(y)

X_train, X_test, y_train, y_test =train_test_split(X,y,test_size= 0.25, random_state=0)

sc_X = StandardScaler() 

X_train = sc_X.fit_transform(X_train)

I am getting the error

Traceback (most recent call last):
  File "/home/arctesian/Scripts/School/EE/Algos/Qual/bayes/sklean.py", line 20, in <module>
    X_train = sc_X.fit_transform(X_train)
  File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/base.py", line 867, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py", line 809, in fit
    return self.partial_fit(X, y, sample_weight)
  File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py", line 844, in partial_fit
    X = self._validate_data(
  File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/base.py", line 577, in _validate_data
    X = check_array(X, input_name="X", **check_params)
  File "/home/arctesian/.local/lib/python3.10/site-packages/sklearn/utils/validation.py", line 856, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: 'twitter ios beta lays groundwork bitcoin tips'

Printing this out shows that the random splitting of the data makes that line the first line so it must be a problem with trans coding the data. How do I fix this problem?

Please edit your question and post the full text of any errors or tracebacks. — MattDMo
– MattDMo, Commented Jul 22, 2022 at 23:51
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community
– Community Bot, Commented Jul 23, 2022 at 6:18
I am simply asking how do I convert this text to a float so it can be processed — Daniel Okita
– Daniel Okita, Commented Jul 23, 2022 at 17:57

Joshua Megnauth · Accepted Answer · 2022-07-23 21:07:10Z

Sometimes searching for the right question on Stack Overflow (or the internet as a whole) is difficult. The reason why you're having trouble finding an answer is because your question is related to NLP based on your CSV containing lemmas.

You'll have to preprocess your data in some way such as by using word vectors. Word vectors are essentially a model trained on a large corpus of text data so that each word can be represented by a N length vector. I'm greatly simplifying this of course.

Another strategy is to use the bag of words approach. A bag of words takes the count of each word that appears in your corpus. You use the bag of words rather than the original strings to train your models. Here's a very small example using scikit-learn's CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I like cats", "meow", "Espeon is a cool Pokemon", "my friend has lotsof pet fish",
          "my pet cat wants to eat my friend's fish", "spams spam", "not spam",
          "someone please hire me for a job", "nlp is cool",
          "this corpus isn't actually large enough to use counter vectorizer well"]

count_vec = CountVectorizer(ngram_range=(
    1, 3), stop_words="english").fit(corpus)

corpus_cv = count_vec.transform(corpus)

I skipped steps to keep the code concise, but the above is the gist of using CountVectorizer.

Daniel Okita · Accepted Answer · 2022-07-25 00:35:13Z

So I fixed it by using @joshua megauth method and getting rid of pandas. Did this:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from coalas import csvReader as c
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# df = pd.read_csv('lemma copy.csv')
def vect(X):
    features = vectorizer.fit_transform(X)

    features_nd = features.toarray()
    return features_nd

def test():
    y_pred = classifer.predict(X_test)
    print(accuracy_score(y_pred, y_test))

if __name__ == "__main__":
    c.importCSV('lemma copy.csv')
    vectorizer = CountVectorizer(
        analyzer = 'word',
        lowercase = False,
    )
    X = c.lemma
    # y = c.Best
    y = c.trained 
    features_nd = vect(X)
    X_train, X_test, y_train, y_test =train_test_split(features_nd,y,test_size= 0.2, random_state=0)
    sc_X = StandardScaler() 
    # print(X_train)
    X_train = sc_X.fit_transform(X_train)
    X_test = sc_X.fit_transform(X_test)

    classifer = GaussianNB()

    classifer.fit(X_train, y_train)    
    test()

Collectives™ on Stack Overflow

Sk learn could not convert string to float

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related