I have a database where one of the columns is BLOB data type("STOREDFILEBLOB"). The data looks like this when I query the database using TOAD(Oracle):
DOC VERK FILENAME STOREDFILEBLOB
434 2343 sow2.rtf (HUGEBLOB)
342 352 soodata.doc (HUGEBLOB)
123 456 wan_tech.doc (HUGEBLOB)
The BLOB holds the document referenced in the 'FILENAME' column. I need to ingest the BLOB in human readable form into python; ideally I would want it to be a long string with the contents of the 'FILENAME' doc. I will be using the info in the BLOB to do some text classification using machine learning. I'm using the following to read from the database. The problem is once the data is brought into python, the column is no longer a BLOB but an object.
conn = pyodbc.connect(conn_str)
query = '''
select dockey,verkey,filename,storedfileblob
from supportdoc
where upper(filename) like '%SOO%' or upper(filename) like '%SOW%'
fetch first 15 rows only;
'''
test = pd.read_sql(query,conn)
print(test)
DOC VERK FILENAME STOREDFILEBLOB
434 2343 sow2.rtf b'\xa0\xa0\xa0\xa0\xfc\x0e\x00\x00\xff{\\rtf1\...
342 352 soo_data.doc b'\xa0\xa0\xa0\xa0\xd3&\x00\x00\xff\xd0\xcf\x1...
123 456 wan_tech_sow.doc b'\xa0\xa0\xa0\xa0\x8a\x19\x00\x00\xff\xd0\xcf...
test.dtypes
DOC float64
VERK float64
FILENAME object
STOREDFILEBLOB object
dtype: object
I tried doing the conversion in the database using this sql:
select DOC, VERK, FILENAME, UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.substr(storedfileblob, 400,1))
from supportdoc
where upper(filename) like '%SOO%' or upper(filename) like '%SOW%'
but I got this nonsensical output.
DOC VERK FILENAME UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.SUBSTR(
6908 8761 SOW (9503581Q0003).rtf ü
9535 8890 Dataequip SOW9706000Q0008.doc j
9553 8891 9602001Q0002WritingSOW.doc T
APPROACH 2:
I've decided to step back and try to get the .pdf files decoded and in human-readable form first. I updated the query to only yield .pdf type files and am not using pyodbc but cx_Oracle since everyone else is using that.
New Code:
query = '''
select doc,verk,filename,storedfileblob
from supportdoc
where (upper(filename) like '%SOO%' or upper(filename) like '%SOW%' or upper(filename) like '%PWS%')
and (substr(upper(filename),-3) like '%PDF%')
fetch first 15 rows only
'''
dsn = cx_Oracle.makedsn(host, port, sid)
orcl = cx_Oracle.connect(username+'/'+password+'@'+dsn)
curs = orcl.cursor()
curs.execute(query)
rows = curs.fetchall()
for row in rows:
filename = 'F:/Users/Acme'+'/contract_blob/'+str(row[0])+'_'+str(row[1])+'_dockey_verkey.pdf'
f = codecs.open(filename, encoding='utf-16', mode='wb+')
f.write(row[3].read())
f.close()
The code above yields:
TypeError: utf_16_encode() argument 1 must be str, not bytes
I picked utf-16 at random for the encoding. I did some research and pdfs tend to be utf-16(atleast that's what I understood from what I read). I'm just grasping at straws.
At the end of the day, my objective is the same. I need to retrieve BLOBs from the database then decode the BLOB and get the human readable document. I'm starting with the pdf documents. There are also .doc, .png. and .zip type BLOB files. I'm hoping once I perfect the method for the .pdfs, it will be easier to tackle the .doc BLOBs. I'll probably ignore the .png and .zip files. Any help is appreciated.