I'm trying to parse a HTML page, but the encoding is messing my results. After some research I found a very popular solution using utf8_encode() and utf8_decode(), but it doesn't change anything. In the following lines, you can check my code and the output.
Code
$str_html = $this->curlHelper->file_get_contents_curl($page);
$str_html = utf8_encode($str_html);
$dom = new DOMDocument();
$dom->resolveExternals = true;
$dom->substituteEntities = false;
@$dom->loadHTML($str_html);
$xpath = new DomXpath($dom);
(...)
$profile = array();
for ($index = 0; $index < $table_lines->length; $index++) {
$desc = utf8_decode($table_lines->item($index)->firstChild->nodeValue);
}
Output
Testar é bom
Should be
Testar é bom
What I've tried
htmlentities():
htmlentities($table_lines->item($index)->lastChild->nodeValue, ENT_NOQUOTES, ini_get('ISO-8859-1'), false);htmlspecialchars():
htmlspecialchars($table_lines->item($index)->lastChild->nodeValue, ENT_NOQUOTES, 'ISO- 8859-1', false);Change my file's charset as decribed here.
Some more information
- Website encoding:
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1" />
Thanks in advance!