1

I'm working on a script that will ETL data from an Oracle database to PostgreSQL. I'm using jaydebeapi to connect to Oracle and psycopy2 for PSQL. I am loading the data into PSQL by streaming the data into the copy_from function -- this worked well for the my ETL from a MySQL database. I'm having a bit of an issue with one string, but I'm sure their may be others. I have a function that evaluates every field in the result set from Oracle and cleans it up if it's a string. In the source database Doña Ana is stored in the county table, but it's stored as Do\xf1a Ana, so when I try to load this in PSQL, it's throwing:

invalid byte sequence for encoding "UTF8": 0xf1 0x61 0x20 0x41 
import six
import unicodedata

def prepdata(value):                                                                                                                                                                                                                                                                                           
    encodedvalue = bytearray(value, 'utf-8')
    print(encodedvalue)
    decodedvalue = encodedvalue.decode('utf-8')
    print(decodedvalue)
    cleanedvalue = unicodedata.normalize(u'NFD', decodedvalue).encode('ASCII', 'ignore').decode('utf-8')
    print(cleanedvalue)
    return cleanedvalue

OUTPUT:

b'Do\\xf1a Ana'                                                                  
Do\xf1a Ana
Do\xf1a Ana                                                                      

It looks like when I try to encode Do\xf1a Ana it's just escaping the backslach rather converting it.

When I try normalizing the string using the interpreter:

>>> x = 'Do\xf1a Ana'
>>> x
'Doña Ana'
>>> p = bytearray(x,'utf-8')
>>> p
bytearray(b'Do\xc3\xb1a Ana')
>>> a = p.decode('utf-8')
>>> a
'Doña Ana'
>>> normal = unicodedata.normalize('NFKD', a).encode('ASCII', 'ignore').decode('utf-8')
>>> normal
'Dona Ana'

Can anyone explain what's going on? Obviously the value coming from the database has something going on with it even though it's coming across as a str.

2
  • So your code works just fine in the interpreter? At what line do you get the error? I rand all of your code and it worked in my interpreter. Commented Aug 1, 2019 at 17:48
  • yes, works fine in the interpreter, but when I run the script with actual data from the database, it fails with the 'Do\xf1a Ana' value. In this case It's failing when it's attempting to load the data into PSQL - the database is encoded as UTF-8. I don't fully understand the encoding/decoding stuff, but I believe the database should accept the letter 'n' with a tilde. Commented Aug 1, 2019 at 18:08

1 Answer 1

1

I was able to get this work using the `unicode_escape' decoding after I an initial encoding of the string to get it to bytes.

def prepdata(value):                                                                                                                                                                                                                                                                                           
    encodedvalue = value.encode()
    decodedvalue = encodedvalue.decode('unicode_escape')
    cleanedvalue = decodedvalue.replace("\r"," ")
    # there are also a list of other things happening below 
    # cleaning the string of things that may cause issues like '\n'.
    return cleanedvalue
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.