Working with 'HUGEBLOB' datatype in python

Ask Question

Asked 6 years ago

Modified 6 years ago

Viewed 353 times

I have a database where one of the columns is BLOB data type("STOREDFILEBLOB"). The data looks like this when I query the database using TOAD(Oracle):

DOC    VERK    FILENAME       STOREDFILEBLOB
434    2343    sow2.rtf       (HUGEBLOB)
342    352     soodata.doc    (HUGEBLOB)
123    456     wan_tech.doc   (HUGEBLOB)

The BLOB holds the document referenced in the 'FILENAME' column. I need to ingest the BLOB in human readable form into python; ideally I would want it to be a long string with the contents of the 'FILENAME' doc. I will be using the info in the BLOB to do some text classification using machine learning. I'm using the following to read from the database. The problem is once the data is brought into python, the column is no longer a BLOB but an object.

conn = pyodbc.connect(conn_str)

query = '''
select dockey,verkey,filename,storedfileblob
from supportdoc 
where upper(filename) like '%SOO%' or upper(filename) like '%SOW%'
fetch first 15 rows only;
'''

test = pd.read_sql(query,conn)

print(test)
DOC    VERK    FILENAME          STOREDFILEBLOB
434    2343    sow2.rtf          b'\xa0\xa0\xa0\xa0\xfc\x0e\x00\x00\xff{\\rtf1\...
342    352     soo_data.doc      b'\xa0\xa0\xa0\xa0\xd3&\x00\x00\xff\xd0\xcf\x1...
123    456     wan_tech_sow.doc  b'\xa0\xa0\xa0\xa0\x8a\x19\x00\x00\xff\xd0\xcf...  


test.dtypes

DOC                float64
VERK               float64
FILENAME           object
STOREDFILEBLOB     object
dtype: object

I tried doing the conversion in the database using this sql:

select DOC, VERK, FILENAME, UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.substr(storedfileblob, 400,1))
from supportdoc 
where upper(filename) like '%SOO%' or upper(filename) like '%SOW%'

but I got this nonsensical output.

DOC     VERK    FILENAME                            UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.SUBSTR(
6908    8761    SOW (9503581Q0003).rtf          ü
9535    8890    Dataequip SOW9706000Q0008.doc       j
9553    8891    9602001Q0002WritingSOW.doc      T

APPROACH 2:

I've decided to step back and try to get the .pdf files decoded and in human-readable form first. I updated the query to only yield .pdf type files and am not using pyodbc but cx_Oracle since everyone else is using that.

New Code:

query = '''
select doc,verk,filename,storedfileblob
from supportdoc 
where (upper(filename) like '%SOO%' or upper(filename) like '%SOW%' or upper(filename) like '%PWS%')
and (substr(upper(filename),-3) like '%PDF%')
fetch first 15 rows only
'''

dsn = cx_Oracle.makedsn(host, port, sid)  
orcl = cx_Oracle.connect(username+'/'+password+'@'+dsn)
curs = orcl.cursor()
curs.execute(query)
rows = curs.fetchall()


for row in rows:
    filename = 'F:/Users/Acme'+'/contract_blob/'+str(row[0])+'_'+str(row[1])+'_dockey_verkey.pdf'
    f = codecs.open(filename, encoding='utf-16', mode='wb+')
    f.write(row[3].read())
    f.close()

The code above yields:

TypeError: utf_16_encode() argument 1 must be str, not bytes

I picked utf-16 at random for the encoding. I did some research and pdfs tend to be utf-16(atleast that's what I understood from what I read). I'm just grasping at straws.

At the end of the day, my objective is the same. I need to retrieve BLOBs from the database then decode the BLOB and get the human readable document. I'm starting with the pdf documents. There are also .doc, .png. and .zip type BLOB files. I'm hoping once I perfect the method for the .pdfs, it will be easier to tackle the .doc BLOBs. I'll probably ignore the .png and .zip files. Any help is appreciated.

edited Nov 14, 2019 at 16:10

asked Nov 13, 2019 at 22:36

wolf7687

1551 silver badge9 bronze badges

This might help stackoverflow.com/questions/51868112/…

Paul
– Paul

2019-11-14 07:47:42 +00:00
Commented Nov 14, 2019 at 7:47
On the cx_Oracle side, review the LOB documentation cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html Either query mode will give you the binary data in Python. You will then need to use some Python utility to convert whatever format it is in (Word format?) into something readable.

Christopher Jones
– Christopher Jones

2019-11-14 09:43:55 +00:00
Commented Nov 14, 2019 at 9:43

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Working with 'HUGEBLOB' datatype in python

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked