3

I have some HTML data (over which I have no control, can only read it) that contains a lot of Scandinavian characters (å, ä, ö, æ, ø, etc.). These "special" chars are stored as HTML character numbers (æ = æ). I need to convert these to the corresponding actual character in PHP (or JavaScript but I guess PHP is better here...). Seems like html_entity_decode() only handles the "other" kind of entities, where æ = &#aelig;. The only solution I've come up with so far is to make a conversion table and map each character number to a real character, but that's not really super smart... So, any ideas? ;)

Cheers, Christofer

2
  • In what way isn't html_entity_decode() working for you? What are you passing as the charset parameter? Seems to work for me... Commented Sep 8, 2010 at 15:02
  • Yeah turns out it works perfectly fine... if you read the manual properly ;) Thanks! Commented Sep 9, 2010 at 10:37

4 Answers 4

5
&#NUMBER;

refers to the unicode value of that char.

so you could use some regex like:

/&#(\d+);/g

to grab the numbers, I don't know PHP but im sure you can google how to turn a number into its unicode equivalent char.

Then simply replace your regex match with the char.

Edit: Actually it looks like you can use this:

mb_convert_encoding('æ', 'UTF-8', 'HTML-ENTITIES');
Sign up to request clarification or add additional context in comments.

Comments

2

I think html_entity_decode() should work just fine. What happens when you try:

echo html_entity_decode('æ', ENT_COMPAT, 'UTF-8');

Comments

0

On the PHP manual page on html_entity_decode(), it gives the following code for decoding numeric entities in versions of PHP prior to 4.3.0:

  $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
  $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);

As someone noted in the comments, you should probably replace chr() with unichr() to deal with non-ASCII characters.

However, it looks like html_entity_decode() really should deal with numeric as well as literal entities. Are specifying an appropriate charset (e.g.,UTF-8)?

Comments

0

If you haven't got the luxury of having multibyte string functions installed, you can use something like this:

<?php

    $string = 'Here is a special char &#230;';

    $list = preg_replace_callback('/(&#([0-9]+);)/', create_function(
        '$matches', 'return decode(array($matches[2]));'
    ), $string);

    echo '<p>', $string, '</p>';
    echo '<p>', $list, '</p>';

    function decode(array $list)
    {
        foreach ($list as $key=>$value) {
            return utf8_encode(chr($value));
        }
    }


?>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.