1

Simple problem, having trouble working out a solution.

I'm trying to retrieve a multibyte characters from a Postgres database encoded as UTF-8 and then return them, but I'm having encoding issues.

Here's my DB:

   Name    |  Owner   | Encoding |   Collate   |    Ctype    |     Access privileges
-----------+----------+----------+-------------+-------------+---------------------------
 articles  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |

And the data within the table:

                         docid                         |     unigram
-------------------------------------------------------+-----------------
 en_2014-02-09_5eb67dc1927248d7926cdaf72559b57a7f9c017 | Haluk Bürümekçi

The 'unigram' has some multibyte characters. Here's my simplified Python:

def test():
    con = psycopg2.connect(params)
    cur = con.cursor()

    cur.execute("SELECT docid, unigram FROM test")

    row = cur.fetchone()

    try:
        print unicode(row[1])
    except Exception, E:
        traceback.print_exc()

This is resulting in:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

I've tried a lot of different things I've seen, including:

row[1].decode(sys.getdefaultencoding()).encode('utf-8')
row[1].decode('utf-8')
row[1].encode('utf-8')
unicode(row[1])
str(row[1])

All of these and more iterations of similar tries still result in the UnicodeDecodeError. Does anyone know what exactly I'm doing wrong?

1 Answer 1

1

Use unicode(row[1], 'utf-8'). This constructs a unicode string by decoding the string in row[1] using the utf-8 codec :)

Sign up to request clarification or add additional context in comments.

1 Comment

Ah, perfect. I knew it had to be something simple. Thanks you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.