1

I'm just trying to print my script. I have this problem, I have researched and read many answers and even adding .encode ('utf-8) still does not work.

import pandas
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_components = 30
n_top_words = 10

def print_top_words(model, feature_names, n_top_words):]
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx 
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])

        return message

text = pandas.read_csv('fr_pretraitement.csv', encoding = 'utf-8')
text_clean = text['liste2']
text_raw = text['liste1']
text_clean_non_empty = text_clean.dropna()
not_commas = text_raw.str.replace(',', '')
text_raw_list = not_commas.values.tolist()
text_clean_list = text_clean_non_empty.values.tolist()

tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(text_clean_list)
tf_feature_names = tf_vectorizer.get_feature_names()

lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                            learning_method='online',
                            learning_offset=50.,
                            random_state=0)

lda.fit(tf)

print('topics...')
print(print_top_words(lda, tf_feature_names, n_top_words))


document_topics = lda.fit_transform(tf)
topics = print_top_words(lda, tf_feature_names, n_top_words)
for i in range(len(topics)):
    print("Topic {}:".format(i))
    docs = np.argsort(document_topics[:, i])[::-1]
    for j in docs[:300]:
       cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])    
       print(cleans.encode('utf-8') + ',' + " ".join(text_raw_list[j].encode('utf-8').split(",")[:2]))

My output:

Traceback (most recent call last):

File "script.py", line 62, in

cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])

TypeError: a bytes-like object is required, not 'str'

2 Answers 2

1

Let's look at the line in which the error raised:

cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])

Let's go step by step:

  • text_clean_list[j] is of str type => no error until there
  • text_clean_list[j].encode('utf-8') is of bytes type => no error until there
  • text_clean_list[j].encode('utf-8').split(",") is wrong: the parameter "," passed to split() method is of str type, but it must have been of bytes type (because here split() is a method from a bytes object) => the error is raised, indicating that a bytes-like object is required, not 'str'.

Note: Replacing split(",") with split(b",") avoids the error (but it may not be the behavior you expect...)

Sign up to request clarification or add additional context in comments.

Comments

1
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])

You are encoding the string inside text_clean_list[j] into the bytes but what about the split(",")?

"," still is a str. Now you are trying to split byte like object using a string.

Example:

a = "this,that"
>>> a.encode('utf-8').split(',')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'

Edit

Solution: 1- One solution could be that don't encode your string object right now, just split first and then encode later on. Like in my example:

a = "this, that"
c = a.split(",")
cleans = [x.encode('utf-8') for x in c]

2- Just use a simple encoding of "," itself.

cleans = a.encode("utf-8").split("b")

Both yields same answer. It would be better if you could just come up with input and output examples.

1 Comment

I just can do this cleans = " ".join(text_clean_list[j].encode('utf-8').split(",").encode('utf-8')[:2]) for encoding ',' too. Still don't work...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.