decode content while reading from socket in Python

Question

Assume I read some content from socket in Python and have to decode it to UTF-8 on-the-fly.

I can not afford to keep all the content in memory, so I must decode it as I receive and save to file.

It can happen, that I will only receive partial bytes of character, (€-sign is represented by three bytes for example in Python as '\xe2\x82\xac').

Assume I have received only the first two bytes (\xe2\x82), if I try to decode it, I'm getting 'UnicodeDecodeError', as expected.

I could always try to decode the current content and check if it throws an Exception

But how reliable is this approach?
How can I know or determine if I can decode the current content?
How to do it correct?

Thanks

Ignacio Vazquez-Abrams · Accepted Answer · 2014-12-27 21:28:20Z

6

Guido's time machine strikes again.

>>> dec = codecs.getincrementaldecoder('utf-8')()
>>> dec.decode('foo\xe2\x82')
u'foo'
>>> dec.decode('\xac')
u'\u20ac'

answered Dec 27, 2014 at 21:28

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2624744 Over a year ago

It works! Does this decoder keep the state? Otherwise how it knows about already available bytes in it? what about memory consumption? Do I have to recreate the decoder every period of time?

Ignacio Vazquez-Abrams Over a year ago

It probably stores the undecoded bytes somewhere. With UTF-8 that means that it will store up to 3 bytes. The second argument to decode() finalizes the current decode operation and allows you to use reset() to recycle it.

Savir · Accepted Answer · 2014-12-27 21:59:30Z

How about using a combination of functools.partial and codecs.iterdecode (as shown here)?

I have created a file full of € symbols, and seems to work as expected (although instead of reading from a file, as shown below, you would be reading from your socket):

#!/usr/bin/env python

import codecs
import functools
import sys

with open('stack70.txt', 'rb') as euro_file:
    f_iterator = iter(functools.partial(euro_file.read, 1), '')
    for item in codecs.iterdecode(f_iterator, 'utf-8'):
        print "sizeof item: %s, item: %s" % (sys.getsizeof(item), item)

DISCLAIMER: I have little experience with codecs, so I'm not 100% sure this will do what you want, but (as far as I can tell), it does, right?

stack70.txt is the file full of "euro" symbols. The code above outputs:

sizeof item: 56, item: €
sizeof item: 56, item: €
sizeof item: 56, item: €
sizeof item: 56, item: €
sizeof item: 56, item: €

(done using python 2.7)

Collectives™ on Stack Overflow

decode content while reading from socket in Python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related