TypeError: a bytes-like object is required, not 'str': even with the encode

Question

I'm just trying to print my script. I have this problem, I have researched and read many answers and even adding .encode ('utf-8) still does not work.

import pandas
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_components = 30
n_top_words = 10

def print_top_words(model, feature_names, n_top_words):]
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx 
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])

        return message

text = pandas.read_csv('fr_pretraitement.csv', encoding = 'utf-8')
text_clean = text['liste2']
text_raw = text['liste1']
text_clean_non_empty = text_clean.dropna()
not_commas = text_raw.str.replace(',', '')
text_raw_list = not_commas.values.tolist()
text_clean_list = text_clean_non_empty.values.tolist()

tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(text_clean_list)
tf_feature_names = tf_vectorizer.get_feature_names()

lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                            learning_method='online',
                            learning_offset=50.,
                            random_state=0)

lda.fit(tf)

print('topics...')
print(print_top_words(lda, tf_feature_names, n_top_words))


document_topics = lda.fit_transform(tf)
topics = print_top_words(lda, tf_feature_names, n_top_words)
for i in range(len(topics)):
    print("Topic {}:".format(i))
    docs = np.argsort(document_topics[:, i])[::-1]
    for j in docs[:300]:
       cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])    
       print(cleans.encode('utf-8') + ',' + " ".join(text_raw_list[j].encode('utf-8').split(",")[:2]))

My output:

Traceback (most recent call last):

File "script.py", line 62, in

cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])

TypeError: a bytes-like object is required, not 'str'

Laurent H. · Accepted Answer · 2018-08-16 14:44:09Z

1

Let's look at the line in which the error raised:

cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])

Let's go step by step:

text_clean_list[j] is of str type => no error until there
text_clean_list[j].encode('utf-8') is of bytes type => no error until there
text_clean_list[j].encode('utf-8').split(",") is wrong: the parameter "," passed to split() method is of str type, but it must have been of bytes type (because here split() is a method from a bytes object) => the error is raised, indicating that a bytes-like object is required, not 'str'.

Note: Replacing split(",") with split(b",") avoids the error (but it may not be the behavior you expect...)

edited Aug 16, 2018 at 14:44

answered Aug 16, 2018 at 14:39

Laurent H.

6,5761 gold badge21 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

NoorJafri · Accepted Answer · 2018-08-16 16:12:58Z

1

cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])

You are encoding the string inside text_clean_list[j] into the bytes but what about the split(",")?

"," still is a str. Now you are trying to split byte like object using a string.

Example:

a = "this,that"
>>> a.encode('utf-8').split(',')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'

Edit

Solution: 1- One solution could be that don't encode your string object right now, just split first and then encode later on. Like in my example:

a = "this, that"
c = a.split(",")
cleans = [x.encode('utf-8') for x in c]

2- Just use a simple encoding of "," itself.

cleans = a.encode("utf-8").split("b")

Both yields same answer. It would be better if you could just come up with input and output examples.

edited Aug 16, 2018 at 16:12

answered Aug 16, 2018 at 14:27

NoorJafri

1,85718 silver badges27 bronze badges

1 Comment

marin Over a year ago

I just can do this cleans = " ".join(text_clean_list[j].encode('utf-8').split(",").encode('utf-8')[:2]) for encoding ',' too. Still don't work...

Collectives™ on Stack Overflow

TypeError: a bytes-like object is required, not 'str': even with the encode

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related