Python reading a file into unicode strings

Question

I am having some troubles with understanding the correct way to handle unicode strings in Python. I have read many questions about it but it is still unclear what should I do to avoid problems when reading and writing files.

My goal is to read some huge (up to 7GB) files efficiently line by line. I was doing it with the simple with open(filename) as f: but it I ended up with an error in ASCII decoding.

Then I read the correct way of doing it would be to write:

with codecs.open(filename, 'r', encoding='utf-8') as logfile:

However this ends up in:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 13: invalid start byte

Frankly I haven't understood why this exception is raised.

I have found a working solution doing:

with open(filename) as f:
    for line in logfile:
        line = unicode(line, errors='ignore')

But this approach ended up being incredibly slow. Therefore my question is:

Is there a correct way of doing this, and what is the fastest way? Thanks

Are you 100% certain your file is UTF-8 encoded? Your error suggest your file is at least corrupted. — Martijn Pieters
– Martijn Pieters, Commented Jul 20, 2016 at 18:29
Don't use codecs.open(), by the way; use the newer and far more robust io.open() instead. You can specify an errors handler for that call. — Martijn Pieters
– Martijn Pieters, Commented Jul 20, 2016 at 18:30
@MartijnPieters No, I am not 100% sure it is utf-8 encoded. What seems strange is that I am able to open it regularly without considering utf-8. If it were to be corrupted shouldn't open(filename) raise an exception too? If not, and therefore I am forced to go with the unicode() approach, is there a way to make it faster? — ClonedOne
– ClonedOne, Commented Jul 20, 2016 at 18:33
No, decoding only takes place as you read data, opening the file won't test all data in the file if it can be decoded up front. — Martijn Pieters
– Martijn Pieters, Commented Jul 20, 2016 at 18:34
In the end, you are processing 7GB of data as rich Unicode objects in Python. Expect some slowness anyway. — Martijn Pieters
– Martijn Pieters, Commented Jul 20, 2016 at 18:34

Martijn Pieters · Accepted Answer · 2016-07-20 18:33:29Z

6

Your data is probably not UTF-8 encoded. Figure out the correct encoding and use that instead. We can't tell you what codec is right, because we can't see your data.

If you must specify an error handler, you may as well do so when opening the file. Use the io.open() function; codecs is an older library and has some issues that io (which underpins all I/O in Python 3 and was backported to Python 2) is far more robust and versatile.

The io.open() function takes an errors too:

import io

with io.open(filename, 'r', encoding='utf-8', errors='replace') as logfile:

I picked replace as the error handler so you at least give you placeholder characters for anything that could not be decoded.

answered Jul 20, 2016 at 18:33

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ClonedOne Over a year ago

Thank you. This solution works and is slightly faster than the unicode() approach. Still it is way slower than the plain non unicode approach, but maybe there is no way of achieving comparable speeds.

Martijn Pieters Over a year ago

@ClonedOne: depending on how you are processing those lines and what codec was really used, you could probably just treat the data as binary and not decode it.

Collectives™ on Stack Overflow

Python reading a file into unicode strings

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related