2

I'm trying to read XML which has HTML inside an element. It is NOT enclosed in CDATA tags, which is the problem because any XML parser I use tries to parse it as XML.

The point in the XML where it dies:

<item>
  <title>Title text <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="🔥" aria-label="Emoji: Fire"></title>
</item>

Error message:

Warning: XMLReader::readOuterXml(): (xml file here) parser error : Opening and ending tag mismatch: img line 1 and title in (php file here)

I know how to get HTML out of an XML element but the parser doesn't like the fact that it's an open tag and it can't find the closing tag so it dies and I can't get any further.

Now, I don't actually need the <title> element so if there is a way to ignore it, that would work as the information I need is in only two child nodes of the <item> parent.

If anyone can see a workaround to this, that would be great.

Update

Using Christian Gollhardt's suggestions, I've managed to load the XML into an object but I get the same problem I did before where I have issues getting the CDATA from the <description> element.

This is the CDATA I should get:

<description>
 <![CDATA[<a href="https://twitter.com/menomatters" >@menomatters</a> <a href="https://twitter.com/physicool1" >@physicool1</a> will chill my own &quot;personal summer&quot;. <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="🔥" aria-label="Emoji: Fire"><img src="https://abs.twimg.com/emoji/v1/72x72/2600.png" draggable="false" alt="☀️" aria-label="Emoji: Black sun with rays">]]>
</description>

This is what I end up with:

["description"]=> string(54) "@menomatters will chill my own "personal summer". ]]>"

Looks like an issue with closing tags again?

10
  • If you don't need the title element, can you preprocess the xml file to remove it, then run it through your parser? Commented Aug 14, 2014 at 15:38
  • 1
    That simply isn’t well-formed XML. Commented Aug 14, 2014 at 15:42
  • @CBroe Yes, it comes from an external source so is out of my control. Commented Aug 14, 2014 at 15:44
  • @vch How would I do this? Would I have to load the xml into a string then str_replace? Or another way? Commented Aug 14, 2014 at 15:51
  • 1
    @AshThornton Don't do that, any solution using string replacement (i.e. regex) for XML is doomed to fail, as regular expressions are not capable of handling XML. The answer which was just deleted, pointing to php.net/manual/de/domdocument.loadhtml.php was actually pretty good Commented Aug 14, 2014 at 15:52

1 Answer 1

3

Take a look at DOMDocument. You can either work direct with it, or you can write a function, witch give you a cleaned document.


Clean Methods:

function tidyXml($xml) {
    $doc = new DOMDocument();
    if (@$doc->loadHTML($xml)) {
        $output = '';
        //Dom Document creates <html><body><myxml></body></html>, so we need to remove it
        foreach ($doc->getElementsByTagName('body')->item(0)->childNodes as $child) {
            $output .= $doc->saveXML($child);
        }
        return $output;
    } else {
        throw new Exception('Document can not be cleaned');
    }
}

function getSimpleXml($xml) {
    return new SimpleXMLElement(tidyXml($xml));
}

Implementation

$xml= '<item><title>Title text <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="�" aria-label="Emoji: Fire"></title></item>';
$myxml = getSimpleXml($xml);

$titleNodeCollection =$myxml->xpath('/item/title');

foreach ($titleNodeCollection as $titleNode) {
    $titleText    = (string)$titleNode;
    $imageUrl     = (string)$titleNode->img['src'];
    $innerContent = str_replace(['<title>', '</title>'], '', $titleNode->asXML());

    var_dump($titleText, $imageUrl, $innerContent);
}

Enjoy!

Sign up to request clarification or add additional context in comments.

7 Comments

This would work beautifully, except that is just the point in the XML where it fails. The <item> element has a few more children but it can't read the <title> child because of the un-closed <img> tag. I've loaded it into the DOM as HTML but I get the error I mentioned in my comment under my question here
I can't try this yet as I am not at my machine but it looks as if that will pull the img tag out. I don't need to do this, I actually need to access the <description> element which is another child of <item> and miss out the <title> element altogether really but my parser breaks at the img tag which is my original problem. I think you're going in the right direction though so I will have a play with this when I can.
@AshThornton Updated, take a look at $innerContent var.
I need to dynamically populate the $xml from the XML though. Will this work if I load the XML document as a string into it?
@AshThornton sure, simple append $titleNode to it, if the other $xml is using SimpleXml. If not you should use $titleNode->asXml() and import this to your document in your prefered way. Another way would be direct implementing this with another Reader, if your not using simplexml. With tidyXml it should work for every reader.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.