Converting a JavaScript function to a PHP function(used to convert string to HTML encoded text)

Question

According to a function here, http://www.unicodetools.com/unicode/convert-to-html.php, the function is used to convert string to HTML encoded text.

The JavaScript is:

function a(b) {
    var c= '';

    for(i=0; i<b.length; i++) {
        if(b.charCodeAt(i)>127) {
            c += '&#' + b.charCodeAt(i) + ';'; 
        } else { 
            c += b.charAt(i); 
        }
  }

  document.forms.conversionForm.outputText.value = c;
}

And my try is:

function str_to_html_entity($str) {
    $output = NULL;

    for($i = 0; $i < strlen($str); $i++) {
        if(ord($str) > 127) {
            $output .= '&#' + ord($str) + ';'; 
        } else { 
            $output .= substr($str, $i); 
        }
  }

  return $output;
}

echo str_to_html_entity("Thére Àre sôme spëcial charâcters ïn thìs têxt");

My PHP function run correctly, but the result is not what I expected:

my result:

Thére Àre sôme spëcial charâcters ïn thìs têxthére Àre sôme spëcial charâcters ïn thìs têxtére Àre sôme spëcial charâcters ïn thìs têxt�re Àre sôme spëcial charâcters ïn thìs têxtre Àre sôme spëcial charâcters ïn thìs têxte Àre sôme spëcial charâcters ïn thìs têxt Àre sôme spëcial charâcters ïn thìs têxtÀre sôme spëcial charâcters ïn thìs têxt�re sôme spëcial charâcters ïn thìs têxtre sôme spëcial charâcters ïn thìs têxte sôme spëcial charâcters ïn thìs têxt sôme spëcial charâcters ïn thìs têxtsôme spëcial charâcters ïn thìs têxtôme spëcial charâcters ïn thìs têxt�me spëcial charâcters ïn thìs têxtme spëcial charâcters ïn thìs têxte spëcial charâcters ïn thìs têxt spëcial charâcters ïn thìs têxtspëcial charâcters ïn thìs têxtpëcial charâcters ïn thìs têxtëcial charâcters ïn thìs têxt�cial charâcters ïn thìs têxtcial charâcters ïn thìs têxtial charâcters ïn thìs têxtal charâcters ïn thìs têxtl charâcters ïn thìs têxt charâcters ïn thìs têxtcharâcters ïn thìs têxtharâcters ïn thìs têxtarâcters ïn thìs têxtrâcters ïn thìs têxtâcters ïn thìs têxt�cters ïn thìs têxtcters ïn thìs têxtters ïn thìs têxters ïn thìs têxtrs ïn thìs têxts ïn thìs têxt ïn thìs têxtïn thìs têxt�n thìs têxtn thìs têxt thìs têxtthìs têxthìs têxtìs têxt�s têxts têxt têxttêxtêxt�xtxtt

expected result:

Th&#233;re &#192;re s&#244;me sp&#235;cial char&#226;cters &#239;n th&#236;s t&#234;xt

Could someone please advise what wrong with my PHP function?

Thanks

UPDATE

function str_to_html_entity($str) {
    $result = null;
    for ($i = 0, $length = mb_strlen($str, 'UTF-8'); $i < $length; $i++) {
        $character = mb_substr($str, $i, 1, 'UTF-8');
        if (strlen($character) > 1) {  // the character consists of more than 1 byte
            $character = htmlentities($character, ENT_COMPAT, 'UTF-8');
        }
        $result .= $character;
    }

  return $result;
}

echo str_to_html_entity("Thére Àre"); // Th&eacute;re &Agrave;re
echo str_to_html_entity("中"); // 中

deceze · Accepted Answer · 2011-11-17 07:47:15Z

2

Generally:

Javascript strings are Unicode aware, which means str[0] will return one character, however long this character is. charCodeAt will correctly return character codes for any character.
PHP strings are dumb binary arrays, in which a character may take up more than one byte. $str[0] and ord only work on single bytes and will therefore mangle any multi-byte characters. See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for an in-depth explanation of this.

Because of this, you can't replicate the exact same algorithm in PHP. Also, in your loop, you're using the whole $str instead of a string offset, which is your other primary problem. To make it Unicode aware, this is probably the nicest way:

$result = null;
foreach (preg_split('/./u', $str) as $character) {
    if (strlen($character) > 1) {  // the character consists of more than 1 byte
        $character = mb_convert_encoding($character, 'HTML-ENTITIES', 'UTF-8');
    }
    $result .= $character;
}

This expects the string to be UTF-8 encoded. As you can see though, there's a nice function called mb_convert_encoding, which can escape a whole block of text in one go, which you're essentially reinventing. Use it instead.

Alternative version for Unicode-impaired PCREs:

$result = null;
for ($i = 0, $length = mb_strlen($str, 'UTF-8'); $i < $length; $i++) {
    $character = mb_substr($str, $i, 1, 'UTF-8');
    if (strlen($character) > 1) {  // the character consists of more than 1 byte
        $character = mb_convert_encoding($character, 'HTML-ENTITIES', 'UTF-8');
    }
    $result .= $character;
}

But seriously, just use $str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8') and be done with it. No looping required.

edited Nov 17, 2011 at 7:47

answered Nov 17, 2011 at 4:41

deceze♦

525k89 gold badges806 silver badges954 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Charles Yeung Over a year ago

Thanks for the information, but I get blank page when running your code, please check UPDATE on my question.

deceze Over a year ago

Probably a problem with your PCRE extension. See: codepad.org/Ueu55vu4 As alternative, instead of splitting the string, go through the offsets one by one in a multi-byte aware fashion using mb_substr.

Charles Yeung Over a year ago

Sorry, it have error on the code,

Warning: preg_split(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 on line 7

.

Charles Yeung Over a year ago

Tested, please see my UPDATE, seems not what I expect, sorry.

deceze Over a year ago

My bad, htmlentities doesn't escape strings that don't need escaping. See update for mb_convert_encoding.

Gabriel Sosa · Accepted Answer · 2011-11-17 04:37:00Z

1

you have several errors in your function. Check mine with some fixes

function str_to_html_entity($str) {
    $output = NULL;

    $lenght = strlen($str);
    for($i = 0; $i < $lenght; $i++) {
        if(ord($str[$i]) > 127) {
            $output .= '&#' . ord($str[$i]) . ';';
        } else {
            $output.= $str[$i];
        }
  }

  return $output;
}

EDIT 1

also use

   $lenght = strlen($str);

to optimize

answered Nov 17, 2011 at 4:37

Gabriel Sosa

7,9644 gold badges42 silver badges48 bronze badges

3 Comments

Charles Yeung Over a year ago

+1, thanks for the fix. However when I place a Chinese words 中, i get 中 from http://www.unicodetools.com/unicode/convert-to-html.php while I get ä¸ from my PHP function. Do you know what wrong? Thanks.

deceze Over a year ago

@Charles As I described, PHP strings and ord work on one byte at a time. The three-byte character "中" therefore ends up occupying three single-byte entities. PHP doesn't have a built-in multibyte ord.

Charles Yeung Over a year ago

Is there any other way can fix the issue?

Collectives™ on Stack Overflow

Converting a JavaScript function to a PHP function(used to convert string to HTML encoded text)

2 Answers 2

5 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related