PHP - Do I need any UTF-8 encoding/decoding?

Question

Ok, I am writing comments to a UTF-8 file that I read within the function below to remove the text in between these comments. My question is, do I need anything different in here to do this successfully for UTF-8 files? Or will the following code below work? Basically, I am wondering if I need utf8_decode and/or utf8_encode functions, or perhaps iconv function?

// This holds the current file we are working on.
$lang_file = 'files/DreamTemplates.russian-utf8.php';

// Can't read from the file if it doesn't exist now can we?
if (!file_exists($lang_file))
    continue;

// This helps to remove the language strings for the template, since the comment is unique
$template_begin_comment = '// ' . ' Template - ' . $lang_file . ' BEGIN...';
$template_end_comment = '// ' . ' Template - ' . $lang_file . ' END!';

$fp = fopen($lang_file, 'rb');
$content = fread($fp, filesize($lang_file));
fclose($fp);

// Searching within the string, extracting only what we need.
$start = strpos($content, $template_begin_comment);
$end = strpos($content, $template_end_comment);

// We can't do this unless both are found.
if ($start !== false && $end !== false)
{
    $begin = substr($content, 0, $start);
    $finish = substr($content, $end + strlen($template_end_comment));

    $new_content = $begin . $finish;

    // Write it into the file.
    $fo = fopen($lang_file, 'wb');
    @fwrite($fo, $new_content);
    fclose($fo);
}

Thanks for your help on this concerning UTF-8 encoding and decoding on strings, even if they are commented strings.

When I write the php comments into the UTF-8 file I am not using any conversion. Should I be?? The string definitions between the php comments is already encoded in UTF-8 however and seems to work fine within the file. Any help appreciated here.

When you run the code, are you experiencing any problems with it? Are the Russian characters being mangled anywhere that the file is used? Can you open the files written by PHP in a text editor and do the characters appear as expected? — curtisdf
– curtisdf, Commented Jun 13, 2012 at 4:58
I'm unable to test this because I lack a UTF-8 file to test this on at the moment in my exact test environment settings for the actual content of the file. I am just wondering if this approach would seem to work without using any utf8 encoding and/or decoding for php comments ONLY?? Cause I write the php comments into the file earlier and the above function should remove all of it. Just need someone to confirm if this is the best way to do this for UTF-8 files only, or if it should be done a different way? — Solomon Closson
– Solomon Closson, Commented Jun 13, 2012 at 5:03
This will probably give you some more insight: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text — deceze
– deceze ♦, Commented Jun 13, 2012 at 5:17

goat · Accepted Answer · 2012-06-13 05:15:23Z

1

No, you don't need to do any conversions.

Also, your extraction code will be reliable in the sense that it wont mangle multibyte characters, although you might want to make sure the end position occurs after the start pos.

answered Jun 13, 2012 at 5:15

goat

31.9k7 gold badges76 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Solomon Closson Over a year ago

Ok, thanks for answering this question bro, and for letting me know that it will be a reliable extraction point also. Cheers :)

Ja͢ck · Accepted Answer · 2012-06-13 05:01:52Z

1

To do this I would use preg_replace instead:

$content = file_get_contents($lang_file);

$template_begin_comment = '// ' . ' Template - ' . $lang_file . ' BEGIN...';
$template_end_comment = '// ' . ' Template - ' . $lang_file . ' END!';

// find from begin comment to end comment
// replace with emptiness
// keep track of how many replacements have been made
$new_content = preg_replace('/' . 
      preg_quote($template_begin_comment, '/') . 
      '.*?' . 
      preg_quote($template_end_comment, '/') . '/s', 
    '', 
    $content, 
    -1, 
    $replace_count
);

if ($replace_count) {
  // if replacements have been made, write the file back again
  file_put_contents($lang_file, $new_content);
}

Because your matching only contains ASCII, this approach is safe enough because the rest is copied verbatim.

Disclaimer

Above code is not tested, if there's anything wrong just let me know.

answered Jun 13, 2012 at 5:01

Ja͢ck

174k39 gold badges269 silver badges317 bronze badges

4 Comments

Solomon Closson Over a year ago

Hello, I see your approach, but can you tell me what is wrong with my approach, if anything? Also, I will need to get rid of the comments as well, does this approach do that? Furthermore, the file being written to contains many many php comments and I don't want to remove anything other than what is in between the $template_begin_comment and $template_end_comment. That, and only that needs to be removed. The rest of the text in there should not be touched.

Ja͢ck Over a year ago

@SolomonClosson I didn't see anything wrong with your approach, it was just using more code :) string operations are binary safe in PHP ... that's not too say it will be aware of Unicode though.

Solomon Closson Over a year ago

Also, I will need to get rid of the comments as well, does this approach do that? Furthermore, the file being written to contains many many php comments and I don't want to remove anything other than what is in between the $template_begin_comment and $template_end_comment. That, and only that needs to be removed. The rest of the text in there should not be touched.

Ja͢ck Over a year ago

This approach takes care of all occurrences of text between the two comments (and the comments are also removed).

Collectives™ on Stack Overflow

PHP - Do I need any UTF-8 encoding/decoding?

2 Answers 2

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related