html entities decoding in php

Question

I seem to be completely unable to get around utf-8 character encoding.

So I'm exporting content from a database as a utf-8 xml file. The software I am importing into is quite strict about character encoding, so I can't just put everything in CDATA tags.

There's a whole bunch of weird characters, e.g. ’, — … already in the data.

These aren't working in the xml and need to be replaced out (normally with just a ' quote).

Ideally, I'd like to decode all the characters, and then use htmlspecialchars($text, ENT_COMPAT, 'UTF-8', FALSE) to encode them back again. But I can't seem to find a function that will decode them. Is there one? I've started to manually go through each entity with a str_replace() but it's turning into a much bigger job than I anticipated.

Any help would be a lifesaver. Thanks

mvds · Accepted Answer · 2010-07-15 18:27:47Z

2

html_entity_decode() perhaps?

in some cases, in character conversion issues in php, it is important to have a locale set. Doesn't matter which, e.g.

setlocale(LC_CTYPE,'en_US.utf8');

But I would advise that any time invested in getting the encoding right from the beginning, without reverting to entities, if at all possible, is worth it.

answered Jul 15, 2010 at 18:27

mvds

47.4k8 gold badges104 silver badges113 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Leon Over a year ago

Thanks, I've been trying html_entity_decode(). But even with the locale set, it still does not seem to convert entities like ndash And yes, my aim now is to remove all these silly characters so entities are needed at all. Unfortunately, I have to work with the data I'm given, and I seem to have hit a brick wall as to how I can correct the encoding. The only solution I can see at the moment is a find and replace.

mvds Over a year ago

maybe you have to install a locale or something, because om my mac (!) it simply works from the commandline: mac:~$ php \n <?php print html_entity_decode("–",ENT_COMPAT,"UTF-8"); ?> \n – (sorry for lack of formatting, \n=newline) debian stock lenny: same.

mvds Over a year ago

If you want to get rid of them alltogether, use iconv and convert from UTF-8 to ASCII//TRANSLIT or ASCII//IGNORE or something like that.

Collectives™ on Stack Overflow

html entities decoding in php

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related