0

There are multiple threads about converting XML to JSON in PHP and I do already have the following code that's working pretty well:

function jsonPrepareXml(object $domNode): void
{
    foreach ($domNode->childNodes as $node) {
        if ($node->hasChildNodes()) {
            jsonPrepareXml($node);
        } else {
            if ($domNode->hasAttributes() && strlen($domNode->nodeValue) !== 0) {
                $domNode->setAttribute("nodeValue", $node->textContent);
                $node->nodeValue = "";
            }
        }
    }
}

$dom = new \DOMDocument();
$dom->loadXML(FileHelpers::fileGetContents($file), LIBXML_NOCDATA);
jsonPrepareXml($dom);
$xmlData = $dom->saveXML();

$sxml = \simplexml_load_string($xmlData);
$json = \json_decode(
    \json_encode($sxml, JSON_THROW_ON_ERROR),
    null,
    512,
    JSON_THROW_ON_ERROR
);

Now I encountered the issue that in some XML-Files Text that is in CData sections is truncated in some cases. I was not able to find what those files have in common. It was not always the same amount of chars. And if I copied only the CData section to an empty XML for debugging the whole data was read. So I thought I would remove the LIBXML_NOCDATA constant as libxml reads the whole text when parsing as cdata. But then the conversion to JSON fails as cdata is not converted. So I thought I would convert cdata nodes to normal text-node like this in the jsonPrepareXml() function

elseif ($node instanceof \DOMCdataSection) {
    $node = new \DOMText((string) $node->nodeValue);
}

But this does not change anything.

Are there any ideas on how to fix this issue? Of course, it would be great if the original function would work, but I was not able to fix this. Even with different PHP versions or libxml versions. So I gave up on this. Currently, I'm on PHP 8.0.11.

Update: So far I was not able to publish an xml file that triggered the error as the files contained a lot of personal data. But now I do have one xml file that shows the error quite nicely: https://drive.google.com/file/d/10iyiH1O6oKG9Zbv91He1_KlCQlhdeZoO/view?usp=sharing If I load the file with the following code, it ends with 'Majapahit Empire, the city' at day 4.

<?php declare(strict_types=1);

$dom = new \DOMDocument();
$dom->loadXML(FileHelpers::fileGetContents($file), LIBXML_NOCDATA);

header("Content-type: text/plain");
echo $dom->saveXML();

So this is event with my function to prepare the attributes for the json conversion. As stated, I can remove LIBXML_NOCDATA but then I get empty nodes when converting to json.

So I would be looking for a fix or at least a workaround that would convert all the cdata notes into normal text-nodes.

The main issue really are the cdata nodes and not the jsonPrepareXml function. I just wanted to use that function for the workaround.

4
  • To reproduce your problem, I would need to get my hands on a XML file, which you describe as: "... in some XML-Files, text that is in CData sections, is truncated in some cases.". How would I go about obtaining such a rare thing? Commented Jun 6, 2022 at 7:22
  • If what you tell us is correct, it looks like a bug in the XML-to-JSON converter that you are using, and it should be possible to work around it by first converting the XML to get rid of the CDATA sections. Commented Jun 6, 2022 at 8:01
  • The if/else clause in the function looks fishy: If you would either process each child, why do you process the parent again and again if a child does not have children? Move the else block before processing the children, then foreach over the child-nodes and drop the if entirely and always process each child node even it has no children at all (it won't hurt). Commented Jun 6, 2022 at 9:29
  • Thank you for your replies. I just added a xml file to reproduce the issue. The jsonPrepareXml worked so far and is not causing the issue. I just wanted to use it for the workaround to convert the cdata nodes. Commented Jun 7, 2022 at 13:21

1 Answer 1

1

No idea this is solving your CDATA/XML issue, but as commented, it looked fishy to me, here my algorithm:

function jsonPrepareNode(DOMNode $node): void
{
    if ($node->hasAttributes() && strlen($node->nodeValue) !== 0) {
        $node->setAttribute("nodeValue", $node->textContent);
        $node->nodeValue = "";
    }

    foreach ($node->childNodes as $child) {
        jsonPrepareNode($child);
    }
}

if it does not yet fully solve your issue, read on for more options:


For more controlled json encoding of XML, including with SimpleXML, I've written a blog-post series that deals with common problem cases and show how you can implement your own XML to JSON style in PHP:

As you use both DOM Document and SimpleXML using only SimpleXML might match your needs, too.

As especially the later encoding examples show how to integrate with the JsonSerialize interface, alternatively it would be possible with DOMDocument and using own Node class(es); compare DOMBlaze, see ref.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.