I'm working on a script that will ETL data from an Oracle database to PostgreSQL. I'm using jaydebeapi to connect to Oracle and psycopy2 for PSQL. I am loading the data into PSQL by streaming the data into the copy_from function -- this worked well for the my ETL from a MySQL database. I'm having a bit of an issue with one string, but I'm sure their may be others. I have a function that evaluates every field in the result set from Oracle and cleans it up if it's a string. In the source database Doña Ana is stored in the county table, but it's stored as Do\xf1a Ana, so when I try to load this in PSQL, it's throwing:
invalid byte sequence for encoding "UTF8": 0xf1 0x61 0x20 0x41
import six
import unicodedata
def prepdata(value):
encodedvalue = bytearray(value, 'utf-8')
print(encodedvalue)
decodedvalue = encodedvalue.decode('utf-8')
print(decodedvalue)
cleanedvalue = unicodedata.normalize(u'NFD', decodedvalue).encode('ASCII', 'ignore').decode('utf-8')
print(cleanedvalue)
return cleanedvalue
OUTPUT:
b'Do\\xf1a Ana'
Do\xf1a Ana
Do\xf1a Ana
It looks like when I try to encode Do\xf1a Ana it's just escaping the backslach rather converting it.
When I try normalizing the string using the interpreter:
>>> x = 'Do\xf1a Ana'
>>> x
'Doña Ana'
>>> p = bytearray(x,'utf-8')
>>> p
bytearray(b'Do\xc3\xb1a Ana')
>>> a = p.decode('utf-8')
>>> a
'Doña Ana'
>>> normal = unicodedata.normalize('NFKD', a).encode('ASCII', 'ignore').decode('utf-8')
>>> normal
'Dona Ana'
Can anyone explain what's going on? Obviously the value coming from the database has something going on with it even though it's coming across as a str.