4

i just started dabbling in php and i'm afraid i need some help to figure out how to manipulate utf-8 strings.

I'm working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding confirms this) which i then proceed to reading it using

$file = fopen("file.txt", "r");
while(!feof($file)){
    $line = fgets($file);
    //...
}
fclose($file);
  • using mb_detect_encoding($line) reports UTF-8
  • If i do echo $line I can see the line properly (no mangled characters) in the browser
    • so I guess everything is fine with browser and apache. Though i did search my apache configuration for AddDefaultCharset and tried adding http meta-tags for character encoding (just in case)

When i try to split the string using $arr = mb_split(';',$line) the fields of the resulting array contain mangled utf-8 characters (mb_detect_encoding($arr[0]) reports utf-8 as well).

So echo $arr[0] will result in something like this: ΑΘΗÎÎ.

I have tried setting mb_detect_order('utf-8'), mb_internal_encoding('utf-8'), but nothing changed. I also tried to manually detect utf-8 using this w3 perl regex because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.

So my question is how can i properly split the string? Is going down the mb_ path the wrong way? What am I missing?

Thank you for your help!

UPDATE: I'm adding sample strings and base64 equivalents (thanks to @chris' for his suggestion)

1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ΑΘΗÎΑ"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="

Ok, so after doing this there seems to be a 77u/ difference between 3. and 5. which according to this is a utf-8 BOM mark. So how can i avoid it?

UPDATE 2: I woke up refreshed today and with your tips in mind i tried it again. It seems that $line=fgets($file) reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i base64_encoded the first and second line, and the 77u/ bom appeared on the base64'd string of the first line only. I then opened up the offending file in vim, and entered :set nobomb :w to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on @hakre's remove_utf8_bom i added it's complementary function

function add_utf8_bom($str){
    $bom= "\xEF\xBB\xBF";
    return substr($str,0,3)===$bom?$str:$bom.$str;
}

and voila each line is read correctly now.

I do not much like this solution, as it seems very very hackish (i can't believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I'll proceed with the above.

Thanks to @chris, @hakre and @jacob for their time!

UPDATE 3 (solution): It turns out after all that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8') and meta-tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />. It also had to be properly enclosed inside an <html><body> section or the browser would not understand the encoding correctly. Thanks to @jake for his suggestion.

Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.

9
  • 1
    I recommend you post sample strings(before and after the split) for people to inspect. To preserve them binary safe, base64_encode() them, otherwise the fine details won't be preserved through the web browsers and stackoverflow etc... Commented Dec 3, 2011 at 18:39
  • @chris +1 it seems that with base64 you might be on to something Commented Dec 3, 2011 at 19:36
  • Something is really odd here. I always use UTF8 strings without BOM in PHP and it works without any issues. How do you output the variables? do you just do echo $line? Are you outputting a whole webpage, i.e. with doctype, header, etc? Or are you using PHP on the command line? Commented Dec 4, 2011 at 11:07
  • @jakob i use a test.php file in a standalone website (ie no wordpress environment or the like is loaded) that is served with apache2, which i then browse to with firefox. I just do echo $line as you say, and then i progressively tried with meta tags and header() and whatnot to declare utf-8 encoding, in hopes that it was something like this, nothing though. I don't contest that the problem lies somewhere in what i do, i just can't tell what it is! Commented Dec 4, 2011 at 11:29
  • 1
    @bottlenecked: I don't know if you are doing it already, but try to output valid HTML in your test.php file, i.e. before you write echo $line, write something like echo '<!DOCTYPE html><html><head><meta charset=utf-8><title>Test Page</title></head><body>';. Commented Dec 4, 2011 at 15:54

4 Answers 4

4

UTF-8 has the very nice feature that it is ASCII-compatible. With this I mean that:

  • ASCII characters stay the same when encoded to UTF-8
  • no other characters will be encoded to ASCII characters

This means that when you try to split a UTF-8 string by the semicolon character ;, which is an ASCII character, you can just use standard single byte string functions.

In your example, you can just use explode(';',$utf8encodedText) and everything should work as expected.

PS: Since the UTF-8 encoding is prefix-free, you can actually use explode() with any UTF-8 encoded separator.

PPS: It seems like you try to parse a CSV file. Have a look at the fgetcsv() function. It should work perfectly on UTF-8 encoded strings as long as you use ASCII characters for separators, quotes, etc.

Sign up to request clarification or add additional context in comments.

3 Comments

indeed, explode was what I used at first, and when i coudn't get it to work it later led me to read about mbstrings
Then your problem might be that the output encoding of the html page is not UTF-8. Check if you have <meta charset=utf-8> somewhere in the page header!
i tried that (it's mentioned somewhere in the overlong problem statement too) but again nada. I also updated the question with new findings again.
1

When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.

I like to use a PHP file similar to the following:

<!DOCTYPE html>
<html>
  <head>
    <meta charset=utf-8>
    <title>Test page for project XY</title>
  </head>
  <body>
     <h1>Test Page</h1>
     <pre><?php
        echo print_r($_GET,1);
     ?></pre>
  </body>
</html>

If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.

Comments

1

Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().

header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);

$peices = mb_split(';', $str);

var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);

Does it? It works as expected for me( bool true, and the strings in the array are correct)

3 Comments

yes, it is working just as you say. The problem seems to come up when reading the same line from the file itself
Are you sure you didn't goof when posting the base64_encoded strings? Because the orig base64 string doesn't have a BOM, and I assume you it was supposed to be the value returned directly from fgets, first line too.
yep. goofed. It was a "manually copy line from editor then paste into php file as argument to base64_encode" kind of thing, since i didn't at the moment understand full implications of this. Sorry for the red herring :(
1

The mb_splitDocs function should be fine, but you should define the charset it's using as well with mb_regex_encodingDocs:

mb_regex_encoding('UTF-8');

About mb_detect_encodingDocs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.

How to remove the BOM mask:

You can filter the string input and remove a UTF-8 bom with a small helper function:

/**
 * remove UTF-8 BOM if string has it at the beginning
 *
 * @param string $str
 * @return string
 */
function remove_utf8_bom($str)
{
   if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF") 
   {
       $str = substr($str, 3);
   }
   return $str;
}

Usage:

$line = remove_utf8_bom($line);

There are probably better ways to do it, but this should work.

3 Comments

I have no problems with your string, actually even a simple explode should work with an UTF-8 encoded string. See codepad.viper-7.com/eODqA5 - Looks like you view the result as ISO-8859-*.
using the add_utf8_bom, explode works as expected for each line. If a better (ie less hackish) solution does not come up i will accept this answer
The less hacky way is to save file.txt w/o BOM. That's what's suggested first for such problems, see unicode.org/faq/utf_bom.html#BOM . Also learn what you need to do in vim to remove the BOM if the file already contains one. mb_split works fine in my eyes, as it should preserve the BOM as it's a valid unicode code-point as well: fileformat.info/info/unicode/char/feff/index.htm - so you better give your application the string that's correctly encoded firsthand or you fix this before parsing or you just continue to use the hack ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.