PHP and DOM - parsing error an XML with inside entities

Question

I have a xml :

<title>My title</title>
<text>This is a text and I love it <3 </text>

When I try to parse it with DOM, I have an error because of the "<3": Warning: DOMDocument::loadXML(): StartTag: invalid element name in Entity...

Do you know how can I escape all inside special char but keeping my XML tree ? The goal is to use this method: $document->loadXML($xmlContent);

Tank a lot for your answers.

EDIT: I forget to say that I cannot modify the XML. I receive it like that and I have to do with it...

That is not valid xml. so your not going to be able to parse it using load xml. Have you tried loadHTML? — Linda Lawton - DaImTo
– Linda Lawton - DaImTo, Commented Feb 11, 2014 at 13:31

luis_pmb · Accepted Answer · 2014-02-11 14:33:49Z

The symbol "<" is a predefined entity in XML and thus cannot be used in a text field. It should be replaced with:

&lt;

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

So the input text should be:

<title>My title</title>
<text>This is a text and I love it &lt;3 </text>

An XML built like that should be rejected, and whoever sends it should replace the predefined entities for the allowed values. Doing said task with tools like htmlentities() and htmlspecialchars(), as Y U NO WORK suggests, is easy and straightforward.

Now, if you really need to parse said data, you need to sanitize it prior to parsing. This is not a recommended behaviour, particularly if you are receiving arbitrary text, but if it is a set of known or predictable characters, regular expressions can do the job.

This one, in particular, will remove a single "<" contained in a "text" element composed by characters, numbers or white spaces:

$xmlContent = preg_replace('/(<text>[a-zA-Z 0-9]*)[<]?([a-zA-Z 0-9]*<\/text>)/', '$1&lt;$2', $xmlContent);

It is very specific, but it is done on purpose: regular expressions are really bad at matching nested structures, such as HTML or XML. Applying more arbitrary regular expressions to HTML or XML can have wildly unexpected behaviours.

Realitätsverlust · Accepted Answer · 2014-02-11 13:15:14Z

0

XML says that every title has to start with a letter, nothing else is allowed, so the title <3 is not possible.

A workaround for this could be htmlentities() or htmlspecialchars(). But even that wont add a valid character to the beginning, so you should think about either:

Manually add a letter in front of the tag with if
Rework your XML so nothing like that can ever happen.

answered Feb 11, 2014 at 13:15

Realitätsverlust

3,9532 gold badges27 silver badges46 bronze badges

Comments

Juan de Parras · Accepted Answer · 2014-02-11 13:20:06Z

0

You need put the content with special chars inside CDATA:

<text><![CDATA[This is a text and I love it <3 ]]></text>

answered Feb 11, 2014 at 13:20

Juan de Parras

7884 silver badges18 bronze badges

1 Comment

Juan de Parras Over a year ago

If you cant modify XML, your cant parse invalid XML.

Collectives™ on Stack Overflow

PHP and DOM - parsing error an XML with inside entities

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related