Converting hex string to binary file makes it corrupt and unable to open

Question

When converting the hexadecimal value, a PDF file, the file is corrupted.
This is the partial hex content of a simple pdf file I want to convert:

0x255044462D312E370D0A25B5B5B5B50D0A312030206F626A0D0A3C3C2F547970652F436174

Full string: jsfiddle, pastebin

This question is a continuation of this question, where I said that I have to do a data migration between two programs that handle files differently. The source program stores the files hex encoded in the database.

I could successfully extract and convert text files to binary files with the following code:

file_put_contents(
    'document.pdf', 
    hex2bin(str_replace('0x', '', $hexPdfString))
);

But when I run this function on a pdf file or other binary file, it is corrupted.
My question is pretty much the same as this one but discussion over there was unfortunately discontinued.

mkl · Accepted Answer · 2021-06-17 13:46:03Z

1

The result of hex decoding your string is corrupted because your string is incomplete, it only contains the first 65535 characters. After hex decoding one can see that the PDF is cut off inside a metadata stream:

20 0 obj
<</Type/Metadata/Subtype/XML/Length 3064>>
stream
<?xpacket begin="ï»¿" id="W5M0MpCehiHzreSzNTczkc9d"?><x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-701">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""  xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>MicrosoftÂ® Word 2019</pdf:Producer></rdf:Description>
<rdf:Description rdf:about=""  xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:creator><rdf:Seq><rdf:li>Samuel Gfeller</rdf:li></rdf:Seq></dc:creator></rdf:Description>
<rdf:Description rdf:about=""  xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreatorTool>MicrosoftÂ® Word 2019</xmp:CreatorTool><xmp:CreateDate>2021-06-17T13:00:19+02:00</xmp:CreateDate><xmp:ModifyDate>2021-06-17T13:00:19+02:00</xmp:ModifyDate></rdf:Description>
<rdf:Description rdf:about=""  xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:C29344F5-3E78-414A-B4E3-775A853B1A0C</xmpMM:DocumentID><xmpMM:InstanceID>uuid:C29344F5-3E78-414A-B4E3-775A853B1A0C</xmpMM:InstanceID></rdf:Description>

The length 65535 of course is special, it's 0xFFFF. Apparently some mechanism you used in retrieving that string could not handle strings longer than 65535 characters. Thus, you have to investigate the source of that string.

Considering the question you consider this question a continuation of, I'd assume that either the field in the MS SQL database you retrieve the data from is limited to 65535 bytes or your database value retrieval code cuts it down.

In the former case there'd be nothing you can do, the database contents simply would be incomplete. In the latter case you'd simply have to enable your database access code to handle long strings.

answered Jun 17, 2021 at 13:46

mkl

97k17 gold badges144 silver badges302 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Samuel Gfeller Over a year ago

Hey, I have the same issue again, but I can't share the binary code publicly. Would you mind giving me a way (discord, email or other) to send you the binary PDF content?

Samuel Gfeller Over a year ago

I sent an email with a few examples.

mkl Over a year ago

I cannot recognize any PDF structure in the example hex strings in your mail. Nor can I recognize the structure of any other common file formats (I don't have a deep knowledge in that regard, though). It might be garbage data, or it might be encrypted data. Either way, I cannot help. One observation though, file_40.pdf and file_45.pdf start with identical 50+ bytes, whatever that may mean...

Sai Abhiram Inapala Over a year ago

@mkl - did you solve the issue? I have a similar problem. 2 SQL IMAGE datatype columns with HEX data. first column has 65535 characters (similar to your case) and the 2nd column is variable in length. how to get the pdf out of this.

Sai Abhiram Inapala Over a year ago

@mkl - I resolved my issue. I thought they were all pdf files but turns out pdfs are converted into .RAR files and saved into the image column. So, when I export, I used .RAR as query out filetype and once the RAR files are extracted the pdf files can be extracted from it using WinRAR.

|

Collectives™ on Stack Overflow

Converting hex string to binary file makes it corrupt and unable to open

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related