PHP - html_entity_decode not decoding everything

Question

I am parsing an HTML page. At some point I am getting the text between a div and using html_entity_decode to print that text.

The problem is that the page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. I have checked and these characters are not encoded on the source page, they are like you see them normally.

The page is using charset="UTF-8"

So, when I use

html_entity_decode($string, ENT_QUOTES, 'UTF-8');

The star, for example, is "decoded" to â˜

$string is being obtained by using

document.getElementById("id-of-div").innerText

I would like to decode them correctly. How do I do that in PHP?

NOTE: I have tried htmlspecialchars_decode($string, ENT_QUOTES); and it produces the same result.

1. Does the star have an equivalent HTML entity? 2. So, what does $string contain? 3. It seems like a character code issue to me. — Marcel Korpel
– Marcel Korpel, Commented Jan 5, 2014 at 21:32
1. I don't have a clue. 2. in theory all the string contained in a specific div. 3. I am not sure. — Duck
– Duck, Commented Jan 5, 2014 at 21:34
"I have checked and these characters are not encoded on the source page ... I would like to decode them correctly." If they're not encoded, how exactly do you expect to decode them? html_entity_decode is purely about converting entities of the form &something; (including numeric values of something) to "real" characters. What you have here looks like a UTF-8 string which you're then echoing in a non-UTF-8 context. — IMSoP
– IMSoP, Commented Jan 5, 2014 at 21:36
Indeed. Part of the question is really, why are you trying to do this? If you've got some UTF-8 characters you want to print out, why are you doing html_entity_decode at all? Why not just, er, print them out? And can we see an example of the source document and your actual code? — Matt Gibson
– Matt Gibson, Commented Jan 5, 2014 at 21:39
I've just tested html_entity_decode on the characters in your question, and, as expected, it leaves them untouched. How are you creating your output, and how are you looking at it? My guess: html_entity_decode is a red herring, and you're actually outputting untouched UTF-8 characters, but your character encoding is wrong, so they get mangled on display. — Matt Gibson
– Matt Gibson, Commented Jan 5, 2014 at 21:48

Matt Gibson · Accepted Answer · 2014-01-05 21:55:41Z

5

I've tried to reproduce your issue with this simple bit of PHP:

<?php
  // Make sure our client knows we're sending UTF-8
  header('Content-Type: text/plain; charset=utf-8');
  $string = "The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.";
  echo 'String: ' . $string . "\n";
  echo 'Decoded: ' . html_entity_decode($string, ENT_QUOTES, 'UTF-8');

As expected, the output is:

String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".

If I change the charset in the header to iso-8859-1, I see this:

String: The page contains characters like this star â˜… or others like shapes like â¬›ï¸Ž, â—„, â—‰, etc. Here are some entities: <span>This is a &quot;test&quot;.
Decoded: The page contains characters like this star â˜… or others like shapes like â¬›ï¸Ž, â—„, â—‰, etc. Here are some entities: <span>This is a "test".

So, I'd say that your issue is a display issue. The "interesting" characters are being left completely untouched by html_entity_decode, as you'd expect. It's just that whatever code you've got, or whatever you're using to look at your output, is using incorrectly using iso-8859-1 to display them.

answered Jan 5, 2014 at 21:55

Matt Gibson

38.3k10 gold badges103 silver badges130 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Duck Over a year ago

you are right! T H A N K S! I forgot to add header('Content-Type: text/html; charset=utf-8'); to the beginning of the code, so it would force UTF-8 to the output. Thanks!

Collectives™ on Stack Overflow

PHP - html_entity_decode not decoding everything

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related