String Encoding/Decoding Issues with DB Data

Question

I'm working on a script that will ETL data from an Oracle database to PostgreSQL. I'm using jaydebeapi to connect to Oracle and psycopy2 for PSQL. I am loading the data into PSQL by streaming the data into the copy_from function -- this worked well for the my ETL from a MySQL database. I'm having a bit of an issue with one string, but I'm sure their may be others. I have a function that evaluates every field in the result set from Oracle and cleans it up if it's a string. In the source database Doña Ana is stored in the county table, but it's stored as Do\xf1a Ana, so when I try to load this in PSQL, it's throwing:

invalid byte sequence for encoding "UTF8": 0xf1 0x61 0x20 0x41

import six
import unicodedata

def prepdata(value):                                                                                                                                                                                                                                                                                           
    encodedvalue = bytearray(value, 'utf-8')
    print(encodedvalue)
    decodedvalue = encodedvalue.decode('utf-8')
    print(decodedvalue)
    cleanedvalue = unicodedata.normalize(u'NFD', decodedvalue).encode('ASCII', 'ignore').decode('utf-8')
    print(cleanedvalue)
    return cleanedvalue

OUTPUT:

b'Do\\xf1a Ana'                                                                  
Do\xf1a Ana
Do\xf1a Ana

It looks like when I try to encode Do\xf1a Ana it's just escaping the backslach rather converting it.

When I try normalizing the string using the interpreter:

>>> x = 'Do\xf1a Ana'
>>> x
'Doña Ana'
>>> p = bytearray(x,'utf-8')
>>> p
bytearray(b'Do\xc3\xb1a Ana')
>>> a = p.decode('utf-8')
>>> a
'Doña Ana'
>>> normal = unicodedata.normalize('NFKD', a).encode('ASCII', 'ignore').decode('utf-8')
>>> normal
'Dona Ana'

Can anyone explain what's going on? Obviously the value coming from the database has something going on with it even though it's coming across as a str.

So your code works just fine in the interpreter? At what line do you get the error? I rand all of your code and it worked in my interpreter. — bart cubrich
– bart cubrich, Commented Aug 1, 2019 at 17:48
yes, works fine in the interpreter, but when I run the script with actual data from the database, it fails with the 'Do\xf1a Ana' value. In this case It's failing when it's attempting to load the data into PSQL - the database is encoded as UTF-8. I don't fully understand the encoding/decoding stuff, but I believe the database should accept the letter 'n' with a tilde. — jlllllll
– jlllllll, Commented Aug 1, 2019 at 18:08

jlllllll · Accepted Answer · 2019-08-01 19:16:44Z

1

I was able to get this work using the `unicode_escape' decoding after I an initial encoding of the string to get it to bytes.

def prepdata(value):                                                                                                                                                                                                                                                                                           
    encodedvalue = value.encode()
    decodedvalue = encodedvalue.decode('unicode_escape')
    cleanedvalue = decodedvalue.replace("\r"," ")
    # there are also a list of other things happening below 
    # cleaning the string of things that may cause issues like '\n'.
    return cleanedvalue

answered Aug 1, 2019 at 19:16

jlllllll

235 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

String Encoding/Decoding Issues with DB Data

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related