Converting a PDF file (or any binary) to a string in python (not grab text out of pdf)

Question

I am using an api that only takes strings. It's intended to store things. I would like to be able to read in a binary file, convert the binary data to a string, and store the string. Then I would like to retrieve the string, convery back to binary, and save the file.

so what I am trying to do is (in python):

PDF -> load into program as string -> store string ->retrieve string ->save as binary PDF file

For example, I have a PDF called PDFfile. I want to read it in:

datafile=open(PDFfile,'rb')
pdfdata=datafile.read()

When I read up on the .read function it says that it's supposed to result in a string. It does not, or if it does, its taking the parts that define it as a binary also. I have two lines of code later that prints it out:

print(pdfdata[:20])
print(str(pdfdata[:20]))

The result is this:

b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'

Those look like 2 bytes types to me, but apparently, the second one is a string. When I do type(pdfdata) I get bytes.

I am struggling to try to get a clean string that represents the PDF file, that I can then convert back to a bytes format. The API fails if I send it without stringifying it.

str(pdfdata)

I have also tried playing around with encode and decode, but I get errors that encode/decode cant handle 0xc4 which is apparently in the binary file.

The final oddity:

When I store the str(pdfdata) and retrieve it into 'retdata' I print some bytes out of it and compare to the original

print(pdfdata[:20])
print(retdata[:20])

i get really different results

b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe

But the data is there, if I show 50 characters of the retdata

b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\

Needless to say, when I retrieve the data, and store as a pdf, its corrupted and doesn't work.

When I save the stringified pdf and the string version of the retrieved data, they are identical. so the storage and retrieval of a string is working fine.

So I think the corruption is happening when I convert to a string.

I know I'm getting loquacious, but you guys like to have all the info.

Because the storage api only takes strings. The storage API is not part of the problem, what I send to the API is exactly what I get back.So the issue is creating the string from a binary file. Seems to me a binary file is a series of bits. I should be able to take every 8 bits and create a character out of it, thus creating a string. Then to convert back to binary, I should be able to take each character, convert to a series of bits and create a binary file. — Shandor
– Shandor, Commented May 21, 2019 at 12:36
Oh! OI freaking got it! The way I got it to work is: 1) load in binary data with a binary file read. 2) Encode the binary data with codecs.encode(data, 'base64'), 3) the result is type 'bytes' so need to convert to string: data.decode(utf-8'), 4) now it can be stored. Then to recover you do the reverse. and this freaking worked with a pdf file! is there a better way? — Shandor
– Shandor, Commented May 21, 2019 at 13:31

Shandor · Accepted Answer · 2019-05-21 14:20:43Z

7

OK I got this to work. The key was to properly encode the binary data BEFORE it was turned into a string.

Step 1) Read in binary data

datafile=open(PDFfile,'rb')
pdfdatab=datafile.read()    #this is binary data
datafile.close()

Step 2) encode the data into a bytes array

import codecs
b64PDF = codecs.encode(pdfdatab, 'base64')

Step 3) convert bytes array into a string

Sb64PDF=b64PDF.decode('utf-8')

Now the string can be restored. To get it back, you just go through the reverse. Load string data from storage into string variable retdata.

#so we have a string and want it to be bytes
bretdata=retdata.encode('utf-8')

#now lets get it back into the binary file format
bPDFout=codecs.decode(bretdata, 'base64')

#open a new file and put defragments data into it!
datafile=open(newPDFFile,'wb')
datafile.write(bPDFout)
datafile.close()

answered May 21, 2019 at 14:20

Shandor

1131 gold badge1 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sampada Over a year ago

I'm having the same issue. But even this solution didn't work. The output pdf file couldn't be opened and when i open it in notepad, it's showing some text in some other language. Can you suggest some other workaround?

Collectives™ on Stack Overflow

Converting a PDF file (or any binary) to a string in python (not grab text out of pdf)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related