3

The content of a file is like following, and the file encoding is utf-8:

cd232704-a46f-3d9d-97f6-67edb897d65f    b'this Friday, Gerda Scheuers will be excited \xe2\x80\x94 but she\xe2\x80\x99s most excited about the merchandise the movie will bring.'

Here is my code:

with open(file, 'r') as f_in:
    for line in f_in:
        tokens = line.split('\t')
        print(tokens[1])

I want to get the right answer - "this Friday, Gerda Scheuers will be excited - but she's most excited about the merchandise the movie will bring."

print(b'\xe2\x80\x94'.decode('utf-8')) #convert into ASCII 

But I can't read the bytes from a file. If I open a file with bytes, I need to decode the line to splite it.

1 Answer 1

3

You can use ast.literal_eval to convert the bytes literal to bytes:

Then, decode it to get string object:

>>> ast.literal_eval(r"b'excited \xe2\x80\x94 but she\xe2\x80\x99s'")
b'excited \xe2\x80\x94 but she\xe2\x80\x99s'
>>> ast.literal_eval(r"b'excited \xe2\x80\x94 but she\xe2\x80\x99s'").decode('utf-8')
'excited — but she’s'

with open(file, 'r') as f_in:
    for line in f_in:
        tokens = line.split('\t')
        # if len(tokens) < 2:
        #    continue
        bytes_part = ast.literal_eval(tokens[1])
        s = bytes_part.decode('utf-8')  # Decode the bytes to convert to a string
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.