While trying to run a string through PHP's htmlentities function, I have some cases where I get a 'Invalid Multibyte Sequence' error. Is there a way to clean the string prior to calling the function to prevent this error from occuring?
7 Answers
As of PHP 5.4 you should use something along the following to properly escape output:
$escapedString = htmlspecialchars($string, ENT_QUOTES | ENT_SUBSTITUTE | ENT_DISALLOWED | ENT_HTML5, $stringEncoding);
ENT_SUBSTITUTE replaces invalid code unit sequences by � (instead of returning an empty string).
ENT_DISALLOWED replaces code points that are invalid in the specified doctype with �.
ENT_HTML5 specifies the used doctype. Depending on what you are using you may choose ENT_HTML401, ENT_XHTML or ENT_XML1.
Using those options you make sure that the result is always valid in the given doctype, regardless of the kind of abominated input you get.
Also, don't forget to specify the $stringEncoding. Relying on the default is a bad idea as it depends on ini settings and may (and did) change between versions.
3 Comments
ENT_HTML5 is redundant for htmlspecialchars. See stackoverflow.com/a/14532168/427545ENT_HTML5 is not redundant, especially when ENT_DISALLOWED is used. It will replace code points that are invalid in the HTML5 doctype with the Unicode Replacement Character. E.g. see this example: codepad.viper-7.com/q5bPMQ The ENT_HTML5 | ENT_DISALLOWED makes sure that the output does not contain any invalid codepoints.I've encountered scenarios where it's not enough to specify UTF-8 and found the ENT_IGNORE option useful. I don't think it's documented for htmlentities, only for htmlspecialchars but it does work in stifling the error.
3 Comments
substr() on the string, which produced invalid UTF8. Using mb_substr() instead fixed my issue. ENT_IGNORE would have worked as well, but it is not a clean solution.For PHP 5.3.0 and below, the default charset for htmlentities() is ISO-8859-1. (Manual)
You are probably applying it to a UTF-8 string. Specify the character set using
htmlentities($string, (whatever), "UTF-8");
Since PHP 5.4.0, the default charset is UTF-8.
Comments
In general the php ini setting display_errors can be used to control whether errors are output to the browser, the ini setting log_errors can be independently used to control whether errors are written to logfile, and if a custom error handler has been set with set_error_handler() then this is always called for all errors and can then read the values of display_errors and log_errors along with the value of error_reporting() and take the appropriate course of action, right?
Wrong! In this case, htmlspecialchars() and htmlentities() only trigger the error if the value of display_errors is false. If the value of display_errors is true then no error is triggered at all! This seemingly nonsensical behaviour makes it impossible to detect these errors during debugging with display_errors on.
1 Comment
html_entities($variable, ENT_QUOTES); always works just fine for me.
1 Comment
Note that using utf-8 requires enabling multibyte string functions. This could mean replacing functions like substr with mb_substr, except that php provides a php ini setting to turn on overloading of those functions with the mb equivalent.
See here for more detail: http://www.php.net/manual/en/mbstring.overload.php