1

I have a database where one of the columns is BLOB data type("STOREDFILEBLOB"). The data looks like this when I query the database using TOAD(Oracle):

DOC    VERK    FILENAME       STOREDFILEBLOB
434    2343    sow2.rtf       (HUGEBLOB)
342    352     soodata.doc    (HUGEBLOB)
123    456     wan_tech.doc   (HUGEBLOB)

The BLOB holds the document referenced in the 'FILENAME' column. I need to ingest the BLOB in human readable form into python; ideally I would want it to be a long string with the contents of the 'FILENAME' doc. I will be using the info in the BLOB to do some text classification using machine learning. I'm using the following to read from the database. The problem is once the data is brought into python, the column is no longer a BLOB but an object.

conn = pyodbc.connect(conn_str)

query = '''
select dockey,verkey,filename,storedfileblob
from supportdoc 
where upper(filename) like '%SOO%' or upper(filename) like '%SOW%'
fetch first 15 rows only;
'''

test = pd.read_sql(query,conn)

print(test)
DOC    VERK    FILENAME          STOREDFILEBLOB
434    2343    sow2.rtf          b'\xa0\xa0\xa0\xa0\xfc\x0e\x00\x00\xff{\\rtf1\...
342    352     soo_data.doc      b'\xa0\xa0\xa0\xa0\xd3&\x00\x00\xff\xd0\xcf\x1...
123    456     wan_tech_sow.doc  b'\xa0\xa0\xa0\xa0\x8a\x19\x00\x00\xff\xd0\xcf...  


test.dtypes

DOC                float64
VERK               float64
FILENAME           object
STOREDFILEBLOB     object
dtype: object

I tried doing the conversion in the database using this sql:

select DOC, VERK, FILENAME, UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.substr(storedfileblob, 400,1))
from supportdoc 
where upper(filename) like '%SOO%' or upper(filename) like '%SOW%'

but I got this nonsensical output.

DOC     VERK    FILENAME                            UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.SUBSTR(
6908    8761    SOW (9503581Q0003).rtf          ü
9535    8890    Dataequip SOW9706000Q0008.doc       j
9553    8891    9602001Q0002WritingSOW.doc      T

APPROACH 2:

I've decided to step back and try to get the .pdf files decoded and in human-readable form first. I updated the query to only yield .pdf type files and am not using pyodbc but cx_Oracle since everyone else is using that.

New Code:

query = '''
select doc,verk,filename,storedfileblob
from supportdoc 
where (upper(filename) like '%SOO%' or upper(filename) like '%SOW%' or upper(filename) like '%PWS%')
and (substr(upper(filename),-3) like '%PDF%')
fetch first 15 rows only
'''

dsn = cx_Oracle.makedsn(host, port, sid)  
orcl = cx_Oracle.connect(username+'/'+password+'@'+dsn)
curs = orcl.cursor()
curs.execute(query)
rows = curs.fetchall()


for row in rows:
    filename = 'F:/Users/Acme'+'/contract_blob/'+str(row[0])+'_'+str(row[1])+'_dockey_verkey.pdf'
    f = codecs.open(filename, encoding='utf-16', mode='wb+')
    f.write(row[3].read())
    f.close()

The code above yields:

TypeError: utf_16_encode() argument 1 must be str, not bytes

I picked utf-16 at random for the encoding. I did some research and pdfs tend to be utf-16(atleast that's what I understood from what I read). I'm just grasping at straws.

At the end of the day, my objective is the same. I need to retrieve BLOBs from the database then decode the BLOB and get the human readable document. I'm starting with the pdf documents. There are also .doc, .png. and .zip type BLOB files. I'm hoping once I perfect the method for the .pdfs, it will be easier to tackle the .doc BLOBs. I'll probably ignore the .png and .zip files. Any help is appreciated.

2

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.