42

I'm building an XML file from scratch and need to know if htmlentities() converts every character that could potentially break an XML file (and possibly UTF-8 data)?

The values will be from a twitter/flickr feed, so I need to be sure-

1

5 Answers 5

60

htmlentities() is not a guaranteed way to build legal XML.

Use htmlspecialchars() instead of htmlentities() if this is all you are worried about. If you have encoding mismatches between the representation of your data and the encoding of your XML document, htmlentities() may serve to work around/cover them up (it will bloat your XML size in doing so). I believe it's better to get your encodings consistent and just use htmlspecialchars().

Also, be aware that if you pump the return value of htmlspecialchars() inside XML attributes delimited with single quotes, you will need to pass the ENT_QUOTES flag as well so that any single quotes in your source string are properly encoded as well. I suggest doing this anyway, as it makes your code immune to bugs resulting from someone using single quotes for XML attributes in the future.

Edit: To clarify:

htmlentities() will convert a number of non-ANSI characters (I assume this is what you mean by UTF-8 data) to entities (which are represented with just ANSI characters). However, it cannot do so for any characters which do not have a corresponding entity, and so cannot guarantee that its return value consists only of ANSI characters. That's why I 'm suggesting to not use it.

If encoding is a possible issue, handle it explicitly (e.g. with iconv()).

Edit 2: Improved answer taking into account Josh Davis's comment belowis .

Sign up to request clarification or add additional context in comments.

6 Comments

Do not use htmlentities for XML; it’s intended for HTML and not XML. XML does only know the five entities amp, lt, gt, apos and quot. But htmlentities will use a lot more (those that are registered for HTML).
Thanks for the thorough explanation and note on using ENC_QUOTES!
The statement "it will make your XML guaranteed legal" couldn't be more wrong though. As mentionned above, htmlentities() uses entities that are not defined in XML. In addition, it does not sanitize bytes that are not supposed to appear in an XML document, such as the NUL byte. It doesn't sanitize invalid UTF-8 either, so in some cases it might become impossible for XML parsers to the resulting document.
What about htmlspecialchars($string, ENT_XML1)
@Meglio as of PHP 7.3.5, using ENT_QUOTES | ENT_XML1 works same as using only ENT_QUOTES, and only ENT_NOQUOTES works same as only ENT_XML1.
|
21

Dom::createTextNode() will automatically escape your content.

Example:

$dom = new DOMDocument;
$element = $dom->createElement('Element');
$element->appendChild(
    $dom->createTextNode('I am text with Ünicödé & HTML €ntities ©'));

$dom->appendChild($element);
echo $dom->saveXml();

Output:

<?xml version="1.0"?>
<Element>I am text with &#xDC;nic&#xF6;d&#xE9; &amp; HTML &#x20AC;ntities &#xA9;</Element>

When you set the internal encoding to utf-8, e.g.

$dom->encoding = 'utf-8';

you'll still get

<?xml version="1.0" encoding="utf-8"?>
<Element>I am text with Ünicödé &amp; HTML €ntities ©</Element>

Note that the above is not the same as setting the second argument $value in Dom::createElement(). The method will only make sure your element names are valid. See the Notes on the manual page, e.g.

$dom = new DOMDocument;
$element = $dom->createElement('Element', 'I am text with Ünicödé & HTML €ntities ©');
$dom->appendChild($element);
$dom->encoding = 'utf-8';
echo $dom->saveXml();

will result in a Warning

Warning: DOMDocument::createElement(): unterminated entity reference  HTML €ntities ©

and the following output:

<?xml version="1.0" encoding="utf-8"?>
<Element>I am text with Ünicödé </Element>

Comments

16

The Gordon's answer is good and explain the XML encode problems, but not show a simple function (or what the blackbox do). Jon's answer starting well with the 'htmlspecialchars' function recomendation, but he and others do some mistake, then I will be emphatic.

A good programmer MUST have control about use or not of UTF-8 in your strings and XML data: UTF-8 (or another non-ASCII encode) IS SAFE in a consistent algorithm.

SAFE UTF-8 XML NOT NEED FULL-ENTITY ENCODE. The indiscriminate encode produce "second class, non-human-readble, encode/decode-demand, XML". And safe ASCII XML, also not need entity encode, when all your content are ASCII.

Only 3 or 4 characters need to be escaped in a string of XML content: >, <, &, and optional ". Please read http://www.w3.org/TR/REC-xml/ "2.4 Character Data and Markup" and "4.6 Predefined Entities". THEN YOU can use 'htmlentities'

For illustration, the following PHP function will make a XML completely safe:

// it is a didactic illustration, USE htmlentities($S,flag)
function xmlsafe($s,$intoQuotes=0) {
if ($intoQuotes)
    return str_replace(array('&','>','<','"'), array('&amp;','&gt;','&lt;','&quot;'), $s);
    // SAME AS htmlspecialchars($s)
else
    return str_replace(array('&','>','<'), array('&amp;','&gt;','&lt;'), $s);
    // SAME AS htmlspecialchars($s,ENT_NOQUOTES)
}

// example of SAFE XML CONSTRUCTION
function xmlTag( $element, $attribs, $contents = NULL) {
$out = '<' . $element;
foreach( $attribs as $name => $val )
   $out .= ' '.$name.'="'. xmlsafe( $val,1 ) .'"';
if ( $contents==='' || is_null($contents) )
    $out .= '/>';
else
    $out .= '>'.xmlsafe( $contents )."</$element>";
return $out;
}

In a CDATA block you not need use this function... But, please, avoid the indiscriminate use of CDATA.

5 Comments

Thanks you!! I tried a lot of combinations of tidy, htmlentities, htmlspecialchars but your xmlsafe is best; ( but before i recommend to use html_entity_decode() )
About my xmlsafe(), as I say, is "for illustration", but thanks! :-) About use of html_entity_decode() with XML, see some more problems and solutions at stackoverflow.com/q/18039765/287948
Really appreciated this answer simply for the example function. I think Jons answer was best, but this, only because in my particular situation, it helped me more, and so I wanted to up vote this. Thanks. (can I up vote two answers?)
You can do better: to correct my code or my english, now the text is open as a Wiki.
Please help to maintain the CDATA Criticism section in Wikipedia.
5

So your question is "is htmlentities()'s result guaranteed to be XML-compliant and UTF-8-compliant?" The answer is no, it's not.

htmlspecialchars() should be enough to escape XML's special characters but you'll have to sanitize your UTF-8 strings either way. Even if you build your XML with, say, SimpleXML, you'll have to sanitize the strings. I don't know about other librairies such as XMLWriter or DOM, I think it's the same.

Comments

0

Thought I'd add this for those who need to sanitize & not lose the XML attributes.

// Returns SimpleXML Safe XML keeping the elements attributes as well
function sanitizeXML($xml_content, $xml_followdepth=true){

    if (preg_match_all('%<((\w+)\s?.*?)>(.+?)</\2>%si', $xml_content, $xmlElements, PREG_SET_ORDER)) {

        $xmlSafeContent = '';

        foreach($xmlElements as $xmlElem){
            $xmlSafeContent .= '<'.$xmlElem['1'].'>';
            if (preg_match('%<((\w+)\s?.*?)>(.+?)</\2>%si', $xmlElem['3'])) {
                $xmlSafeContent .= sanitizeXML($xmlElem['3'], false);
            }else{
                $xmlSafeContent .= htmlspecialchars($xmlElem['3'],ENT_NOQUOTES);
            }
            $xmlSafeContent .= '</'.$xmlElem['2'].'>';
        }

        if(!$xml_followdepth)
            return $xmlSafeContent;
        else
            return "<?xml version='1.0' encoding='UTF-8'?>".$xmlSafeContent;

    } else {
        return htmlspecialchars($xml_content,ENT_NOQUOTES);
    }

}

Usage:

$body = <<<EG
<?xml version='1.0' encoding='UTF-8'?>
<searchResult count="1">
   <item>
      <title>2016 & Au Rendez-Vous Des Enfoir&</title>
   </item>
</searchResult>
EG;
$newXml = sanitizeXML($body);
var_dump($newXml);

Returns:

<?xml version='1.0' encoding='UTF-8'?>
<searchResult count="1">
    <item>
        <title>2016 &amp; Au Rendez-Vous Des Enfoir&amp;</title>
    </item>
</searchResult>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.