3

I am having some troubles with understanding the correct way to handle unicode strings in Python. I have read many questions about it but it is still unclear what should I do to avoid problems when reading and writing files.

My goal is to read some huge (up to 7GB) files efficiently line by line. I was doing it with the simple with open(filename) as f: but it I ended up with an error in ASCII decoding.

Then I read the correct way of doing it would be to write:

with codecs.open(filename, 'r', encoding='utf-8') as logfile:

However this ends up in:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 13: invalid start byte

Frankly I haven't understood why this exception is raised.

I have found a working solution doing:

with open(filename) as f:
    for line in logfile:
        line = unicode(line, errors='ignore')

But this approach ended up being incredibly slow. Therefore my question is:

Is there a correct way of doing this, and what is the fastest way? Thanks

5
  • 1
    Are you 100% certain your file is UTF-8 encoded? Your error suggest your file is at least corrupted. Commented Jul 20, 2016 at 18:29
  • Don't use codecs.open(), by the way; use the newer and far more robust io.open() instead. You can specify an errors handler for that call. Commented Jul 20, 2016 at 18:30
  • @MartijnPieters No, I am not 100% sure it is utf-8 encoded. What seems strange is that I am able to open it regularly without considering utf-8. If it were to be corrupted shouldn't open(filename) raise an exception too? If not, and therefore I am forced to go with the unicode() approach, is there a way to make it faster? Commented Jul 20, 2016 at 18:33
  • No, decoding only takes place as you read data, opening the file won't test all data in the file if it can be decoded up front. Commented Jul 20, 2016 at 18:34
  • In the end, you are processing 7GB of data as rich Unicode objects in Python. Expect some slowness anyway. Commented Jul 20, 2016 at 18:34

1 Answer 1

6

Your data is probably not UTF-8 encoded. Figure out the correct encoding and use that instead. We can't tell you what codec is right, because we can't see your data.

If you must specify an error handler, you may as well do so when opening the file. Use the io.open() function; codecs is an older library and has some issues that io (which underpins all I/O in Python 3 and was backported to Python 2) is far more robust and versatile.

The io.open() function takes an errors too:

import io

with io.open(filename, 'r', encoding='utf-8', errors='replace') as logfile:

I picked replace as the error handler so you at least give you placeholder characters for anything that could not be decoded.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. This solution works and is slightly faster than the unicode() approach. Still it is way slower than the plain non unicode approach, but maybe there is no way of achieving comparable speeds.
@ClonedOne: depending on how you are processing those lines and what codec was really used, you could probably just treat the data as binary and not decode it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.