Simple problem, having trouble working out a solution.
I'm trying to retrieve a multibyte characters from a Postgres database encoded as UTF-8 and then return them, but I'm having encoding issues.
Here's my DB:
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+-------------+-------------+---------------------------
articles | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
And the data within the table:
docid | unigram
-------------------------------------------------------+-----------------
en_2014-02-09_5eb67dc1927248d7926cdaf72559b57a7f9c017 | Haluk Bürümekçi
The 'unigram' has some multibyte characters. Here's my simplified Python:
def test():
con = psycopg2.connect(params)
cur = con.cursor()
cur.execute("SELECT docid, unigram FROM test")
row = cur.fetchone()
try:
print unicode(row[1])
except Exception, E:
traceback.print_exc()
This is resulting in:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)
I've tried a lot of different things I've seen, including:
row[1].decode(sys.getdefaultencoding()).encode('utf-8')
row[1].decode('utf-8')
row[1].encode('utf-8')
unicode(row[1])
str(row[1])
All of these and more iterations of similar tries still result in the UnicodeDecodeError. Does anyone know what exactly I'm doing wrong?