1

I have a xml :

<title>My title</title>
<text>This is a text and I love it <3 </text>

When I try to parse it with DOM, I have an error because of the "<3": Warning: DOMDocument::loadXML(): StartTag: invalid element name in Entity...

Do you know how can I escape all inside special char but keeping my XML tree ? The goal is to use this method: $document->loadXML($xmlContent);

Tank a lot for your answers.

EDIT: I forget to say that I cannot modify the XML. I receive it like that and I have to do with it...

1
  • That is not valid xml. so your not going to be able to parse it using load xml. Have you tried loadHTML? Commented Feb 11, 2014 at 13:31

3 Answers 3

2

The symbol "<" is a predefined entity in XML and thus cannot be used in a text field. It should be replaced with:

&lt;

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

So the input text should be:

<title>My title</title>
<text>This is a text and I love it &lt;3 </text>

An XML built like that should be rejected, and whoever sends it should replace the predefined entities for the allowed values. Doing said task with tools like htmlentities() and htmlspecialchars(), as Y U NO WORK suggests, is easy and straightforward.

Now, if you really need to parse said data, you need to sanitize it prior to parsing. This is not a recommended behaviour, particularly if you are receiving arbitrary text, but if it is a set of known or predictable characters, regular expressions can do the job.

This one, in particular, will remove a single "<" contained in a "text" element composed by characters, numbers or white spaces:

$xmlContent = preg_replace('/(<text>[a-zA-Z 0-9]*)[<]?([a-zA-Z 0-9]*<\/text>)/', '$1&lt;$2', $xmlContent);

It is very specific, but it is done on purpose: regular expressions are really bad at matching nested structures, such as HTML or XML. Applying more arbitrary regular expressions to HTML or XML can have wildly unexpected behaviours.

Sign up to request clarification or add additional context in comments.

Comments

0

XML says that every title has to start with a letter, nothing else is allowed, so the title <3 is not possible.

A workaround for this could be htmlentities() or htmlspecialchars(). But even that wont add a valid character to the beginning, so you should think about either:

  1. Manually add a letter in front of the tag with if
  2. Rework your XML so nothing like that can ever happen.

Comments

0

You need put the content with special chars inside CDATA:

<text><![CDATA[This is a text and I love it <3 ]]></text>

1 Comment

If you cant modify XML, your cant parse invalid XML.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.