0

According to a function here, http://www.unicodetools.com/unicode/convert-to-html.php, the function is used to convert string to HTML encoded text.

The JavaScript is:

function a(b) {
    var c= '';

    for(i=0; i<b.length; i++) {
        if(b.charCodeAt(i)>127) {
            c += '&#' + b.charCodeAt(i) + ';'; 
        } else { 
            c += b.charAt(i); 
        }
  }

  document.forms.conversionForm.outputText.value = c;
}

And my try is:

function str_to_html_entity($str) {
    $output = NULL;

    for($i = 0; $i < strlen($str); $i++) {
        if(ord($str) > 127) {
            $output .= '&#' + ord($str) + ';'; 
        } else { 
            $output .= substr($str, $i); 
        }
  }

  return $output;
}

echo str_to_html_entity("Thére Àre sôme spëcial charâcters ïn thìs têxt");

My PHP function run correctly, but the result is not what I expected:

my result:

Thére Àre sôme spëcial charâcters ïn thìs têxthére Àre sôme spëcial charâcters ïn thìs têxtére Àre sôme spëcial charâcters ïn thìs têxt�re Àre sôme spëcial charâcters ïn thìs têxtre Àre sôme spëcial charâcters ïn thìs têxte Àre sôme spëcial charâcters ïn thìs têxt Àre sôme spëcial charâcters ïn thìs têxtÀre sôme spëcial charâcters ïn thìs têxt�re sôme spëcial charâcters ïn thìs têxtre sôme spëcial charâcters ïn thìs têxte sôme spëcial charâcters ïn thìs têxt sôme spëcial charâcters ïn thìs têxtsôme spëcial charâcters ïn thìs têxtôme spëcial charâcters ïn thìs têxt�me spëcial charâcters ïn thìs têxtme spëcial charâcters ïn thìs têxte spëcial charâcters ïn thìs têxt spëcial charâcters ïn thìs têxtspëcial charâcters ïn thìs têxtpëcial charâcters ïn thìs têxtëcial charâcters ïn thìs têxt�cial charâcters ïn thìs têxtcial charâcters ïn thìs têxtial charâcters ïn thìs têxtal charâcters ïn thìs têxtl charâcters ïn thìs têxt charâcters ïn thìs têxtcharâcters ïn thìs têxtharâcters ïn thìs têxtarâcters ïn thìs têxtrâcters ïn thìs têxtâcters ïn thìs têxt�cters ïn thìs têxtcters ïn thìs têxtters ïn thìs têxters ïn thìs têxtrs ïn thìs têxts ïn thìs têxt ïn thìs têxtïn thìs têxt�n thìs têxtn thìs têxt thìs têxtthìs têxthìs têxtìs têxt�s têxts têxt têxttêxtêxt�xtxtt

expected result:

Th&#233;re &#192;re s&#244;me sp&#235;cial char&#226;cters &#239;n th&#236;s t&#234;xt

Could someone please advise what wrong with my PHP function?

Thanks

UPDATE

function str_to_html_entity($str) {
    $result = null;
    for ($i = 0, $length = mb_strlen($str, 'UTF-8'); $i < $length; $i++) {
        $character = mb_substr($str, $i, 1, 'UTF-8');
        if (strlen($character) > 1) {  // the character consists of more than 1 byte
            $character = htmlentities($character, ENT_COMPAT, 'UTF-8');
        }
        $result .= $character;
    }

  return $result;
}

echo str_to_html_entity("Thére Àre"); // Th&eacute;re &Agrave;re
echo str_to_html_entity("中"); // 中

2 Answers 2

2

Generally:

Because of this, you can't replicate the exact same algorithm in PHP. Also, in your loop, you're using the whole $str instead of a string offset, which is your other primary problem. To make it Unicode aware, this is probably the nicest way:

$result = null;
foreach (preg_split('/./u', $str) as $character) {
    if (strlen($character) > 1) {  // the character consists of more than 1 byte
        $character = mb_convert_encoding($character, 'HTML-ENTITIES', 'UTF-8');
    }
    $result .= $character;
}

This expects the string to be UTF-8 encoded. As you can see though, there's a nice function called mb_convert_encoding, which can escape a whole block of text in one go, which you're essentially reinventing. Use it instead.

Alternative version for Unicode-impaired PCREs:

$result = null;
for ($i = 0, $length = mb_strlen($str, 'UTF-8'); $i < $length; $i++) {
    $character = mb_substr($str, $i, 1, 'UTF-8');
    if (strlen($character) > 1) {  // the character consists of more than 1 byte
        $character = mb_convert_encoding($character, 'HTML-ENTITIES', 'UTF-8');
    }
    $result .= $character;
}

But seriously, just use $str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8') and be done with it. No looping required.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the information, but I get blank page when running your code, please check UPDATE on my question.
Probably a problem with your PCRE extension. See: codepad.org/Ueu55vu4 As alternative, instead of splitting the string, go through the offsets one by one in a multi-byte aware fashion using mb_substr.
Sorry, it have error on the code, Warning: preg_split(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 on line 7.
Tested, please see my UPDATE, seems not what I expect, sorry.
My bad, htmlentities doesn't escape strings that don't need escaping. See update for mb_convert_encoding.
1

you have several errors in your function. Check mine with some fixes

function str_to_html_entity($str) {
    $output = NULL;

    $lenght = strlen($str);
    for($i = 0; $i < $lenght; $i++) {
        if(ord($str[$i]) > 127) {
            $output .= '&#' . ord($str[$i]) . ';';
        } else {
            $output.= $str[$i];
        }
  }

  return $output;
}

EDIT 1

also use

   $lenght = strlen($str);

to optimize

3 Comments

+1, thanks for the fix. However when I place a Chinese words , i get &#20013; from http://www.unicodetools.com/unicode/convert-to-html.php while I get &#228;&#184;&#173; from my PHP function. Do you know what wrong? Thanks.
@Charles As I described, PHP strings and ord work on one byte at a time. The three-byte character "中" therefore ends up occupying three single-byte entities. PHP doesn't have a built-in multibyte ord.
Is there any other way can fix the issue?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.