1

When converting the hexadecimal value, a PDF file, the file is corrupted.
This is the partial hex content of a simple pdf file I want to convert:

0x255044462D312E370D0A25B5B5B5B50D0A312030206F626A0D0A3C3C2F547970652F436174

Full string: jsfiddle, pastebin

This question is a continuation of this question, where I said that I have to do a data migration between two programs that handle files differently. The source program stores the files hex encoded in the database.

I could successfully extract and convert text files to binary files with the following code:

file_put_contents(
    'document.pdf', 
    hex2bin(str_replace('0x', '', $hexPdfString))
);

But when I run this function on a pdf file or other binary file, it is corrupted.
My question is pretty much the same as this one but discussion over there was unfortunately discontinued.

0

1 Answer 1

1

The result of hex decoding your string is corrupted because your string is incomplete, it only contains the first 65535 characters. After hex decoding one can see that the PDF is cut off inside a metadata stream:

20 0 obj
<</Type/Metadata/Subtype/XML/Length 3064>>
stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?><x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-701">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""  xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Microsoft® Word 2019</pdf:Producer></rdf:Description>
<rdf:Description rdf:about=""  xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:creator><rdf:Seq><rdf:li>Samuel Gfeller</rdf:li></rdf:Seq></dc:creator></rdf:Description>
<rdf:Description rdf:about=""  xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreatorTool>Microsoft® Word 2019</xmp:CreatorTool><xmp:CreateDate>2021-06-17T13:00:19+02:00</xmp:CreateDate><xmp:ModifyDate>2021-06-17T13:00:19+02:00</xmp:ModifyDate></rdf:Description>
<rdf:Description rdf:about=""  xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:C29344F5-3E78-414A-B4E3-775A853B1A0C</xmpMM:DocumentID><xmpMM:InstanceID>uuid:C29344F5-3E78-414A-B4E3-775A853B1A0C</xmpMM:InstanceID></rdf:Description>
                                                                                                    
                                                                                                    
                                                                          

The length 65535 of course is special, it's 0xFFFF. Apparently some mechanism you used in retrieving that string could not handle strings longer than 65535 characters. Thus, you have to investigate the source of that string.

Considering the question you consider this question a continuation of, I'd assume that either the field in the MS SQL database you retrieve the data from is limited to 65535 bytes or your database value retrieval code cuts it down.

In the former case there'd be nothing you can do, the database contents simply would be incomplete. In the latter case you'd simply have to enable your database access code to handle long strings.

Sign up to request clarification or add additional context in comments.

6 Comments

Hey, I have the same issue again, but I can't share the binary code publicly. Would you mind giving me a way (discord, email or other) to send you the binary PDF content?
I sent an email with a few examples.
I cannot recognize any PDF structure in the example hex strings in your mail. Nor can I recognize the structure of any other common file formats (I don't have a deep knowledge in that regard, though). It might be garbage data, or it might be encrypted data. Either way, I cannot help. One observation though, file_40.pdf and file_45.pdf start with identical 50+ bytes, whatever that may mean...
@mkl - did you solve the issue? I have a similar problem. 2 SQL IMAGE datatype columns with HEX data. first column has 65535 characters (similar to your case) and the 2nd column is variable in length. how to get the pdf out of this.
@mkl - I resolved my issue. I thought they were all pdf files but turns out pdfs are converted into .RAR files and saved into the image column. So, when I export, I used .RAR as query out filetype and once the RAR files are extracted the pdf files can be extracted from it using WinRAR.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.