I am having some troubles with understanding the correct way to handle unicode strings in Python. I have read many questions about it but it is still unclear what should I do to avoid problems when reading and writing files.
My goal is to read some huge (up to 7GB) files efficiently line by line. I was doing it with the simple with open(filename) as f: but it I ended up with an error in ASCII decoding.
Then I read the correct way of doing it would be to write:
with codecs.open(filename, 'r', encoding='utf-8') as logfile:
However this ends up in:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 13: invalid start byte
Frankly I haven't understood why this exception is raised.
I have found a working solution doing:
with open(filename) as f:
for line in logfile:
line = unicode(line, errors='ignore')
But this approach ended up being incredibly slow. Therefore my question is:
Is there a correct way of doing this, and what is the fastest way? Thanks
codecs.open(), by the way; use the newer and far more robustio.open()instead. You can specify anerrorshandler for that call.open(filename)raise an exception too? If not, and therefore I am forced to go with theunicode()approach, is there a way to make it faster?