3

Assume I read some content from socket in Python and have to decode it to UTF-8 on-the-fly.

I can not afford to keep all the content in memory, so I must decode it as I receive and save to file.

It can happen, that I will only receive partial bytes of character, (€-sign is represented by three bytes for example in Python as '\xe2\x82\xac').

Assume I have received only the first two bytes (\xe2\x82), if I try to decode it, I'm getting 'UnicodeDecodeError', as expected.

I could always try to decode the current content and check if it throws an Exception

  • But how reliable is this approach?
  • How can I know or determine if I can decode the current content?
  • How to do it correct?

Thanks

2 Answers 2

6

Guido's time machine strikes again.

>>> dec = codecs.getincrementaldecoder('utf-8')()
>>> dec.decode('foo\xe2\x82')
u'foo'
>>> dec.decode('\xac')
u'\u20ac'
Sign up to request clarification or add additional context in comments.

2 Comments

It works! Does this decoder keep the state? Otherwise how it knows about already available bytes in it? what about memory consumption? Do I have to recreate the decoder every period of time?
It probably stores the undecoded bytes somewhere. With UTF-8 that means that it will store up to 3 bytes. The second argument to decode() finalizes the current decode operation and allows you to use reset() to recycle it.
1

How about using a combination of functools.partial and codecs.iterdecode (as shown here)?

I have created a file full of symbols, and seems to work as expected (although instead of reading from a file, as shown below, you would be reading from your socket):

#!/usr/bin/env python

import codecs
import functools
import sys

with open('stack70.txt', 'rb') as euro_file:
    f_iterator = iter(functools.partial(euro_file.read, 1), '')
    for item in codecs.iterdecode(f_iterator, 'utf-8'):
        print "sizeof item: %s, item: %s" % (sys.getsizeof(item), item)

DISCLAIMER: I have little experience with codecs, so I'm not 100% sure this will do what you want, but (as far as I can tell), it does, right?

stack70.txt is the file full of "euro" symbols. The code above outputs:

sizeof item: 56, item: €
sizeof item: 56, item: €
sizeof item: 56, item: €
sizeof item: 56, item: €
sizeof item: 56, item: €

(done using python 2.7)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.