3

I am using an api that only takes strings. It's intended to store things. I would like to be able to read in a binary file, convert the binary data to a string, and store the string. Then I would like to retrieve the string, convery back to binary, and save the file.

so what I am trying to do is (in python):

PDF -> load into program as string -> store string ->retrieve string ->save as binary PDF file

For example, I have a PDF called PDFfile. I want to read it in:

datafile=open(PDFfile,'rb')
pdfdata=datafile.read()

When I read up on the .read function it says that it's supposed to result in a string. It does not, or if it does, its taking the parts that define it as a binary also. I have two lines of code later that prints it out:

print(pdfdata[:20])
print(str(pdfdata[:20]))

The result is this:

b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'

Those look like 2 bytes types to me, but apparently, the second one is a string. When I do type(pdfdata) I get bytes.

I am struggling to try to get a clean string that represents the PDF file, that I can then convert back to a bytes format. The API fails if I send it without stringifying it.

str(pdfdata)

I have also tried playing around with encode and decode, but I get errors that encode/decode cant handle 0xc4 which is apparently in the binary file.

The final oddity:

When I store the str(pdfdata) and retrieve it into 'retdata' I print some bytes out of it and compare to the original

print(pdfdata[:20])
print(retdata[:20])

i get really different results

b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4'
b'%PDF-1.3\n%\xc4\xe

But the data is there, if I show 50 characters of the retdata

b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\

Needless to say, when I retrieve the data, and store as a pdf, its corrupted and doesn't work.

When I save the stringified pdf and the string version of the retrieved data, they are identical. so the storage and retrieval of a string is working fine.

So I think the corruption is happening when I convert to a string.

I know I'm getting loquacious, but you guys like to have all the info.

2
  • Because the storage api only takes strings. The storage API is not part of the problem, what I send to the API is exactly what I get back.So the issue is creating the string from a binary file. Seems to me a binary file is a series of bits. I should be able to take every 8 bits and create a character out of it, thus creating a string. Then to convert back to binary, I should be able to take each character, convert to a series of bits and create a binary file. Commented May 21, 2019 at 12:36
  • 1
    Oh! OI freaking got it! The way I got it to work is: 1) load in binary data with a binary file read. 2) Encode the binary data with codecs.encode(data, 'base64'), 3) the result is type 'bytes' so need to convert to string: data.decode(utf-8'), 4) now it can be stored. Then to recover you do the reverse. and this freaking worked with a pdf file! is there a better way? Commented May 21, 2019 at 13:31

1 Answer 1

7

OK I got this to work. The key was to properly encode the binary data BEFORE it was turned into a string.

Step 1) Read in binary data

datafile=open(PDFfile,'rb')
pdfdatab=datafile.read()    #this is binary data
datafile.close()

Step 2) encode the data into a bytes array

import codecs
b64PDF = codecs.encode(pdfdatab, 'base64')

Step 3) convert bytes array into a string

Sb64PDF=b64PDF.decode('utf-8')

Now the string can be restored. To get it back, you just go through the reverse. Load string data from storage into string variable retdata.

#so we have a string and want it to be bytes
bretdata=retdata.encode('utf-8')

#now lets get it back into the binary file format
bPDFout=codecs.decode(bretdata, 'base64')

#open a new file and put defragments data into it!
datafile=open(newPDFFile,'wb')
datafile.write(bPDFout)
datafile.close()
Sign up to request clarification or add additional context in comments.

1 Comment

I'm having the same issue. But even this solution didn't work. The output pdf file couldn't be opened and when i open it in notepad, it's showing some text in some other language. Can you suggest some other workaround?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.