1

Hi All: I met a tricky problem here: I need to read some files and convert its content into some XML files. For each line in the file, I believe most of them are valid ASCII code, so that I could just read the line into php and save the line into an XML file with default encoding XML as 'UTF-8'. However, I noticed that there might be some GBK, GB2312(Chinese character), SJIS(Japanese characters) etc.. existed in the original files, php have no problems to save the string into XML directly. However, the XML parser will detect there are invalid UTF-8 codes and crashed.

Now, I think the best library php function for my purpose is probably:

 $decode_str = mb_convert_encoding($str, 'UTF-8', 'auto');

I try to run this conversation function for each line before inserting it into XML. However, as I tested with some UTF-16 and GBK encoding, I don't think this function could correctly discriminate the input string encoding schema.

In addition, I tried to use CDATA to wrap the string, it's weird that the XML parser still complain about invalid UTF-8 codes etc.. of course, when I vim the xml file, what's inside the CDATA is a mess for sure.

Any suggestions?

1
  • Did you tried iconv() function? Commented Mar 4, 2011 at 8:00

2 Answers 2

2

I spend once a lot of time to create a safe UTF8 encoding function:

function _convert($content) {
    if(!mb_check_encoding($content, 'UTF-8')
        OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

        $content = mb_convert_encoding($content, 'UTF-8');

        if (mb_check_encoding($content, 'UTF-8')) {
            // log('Converted to UTF-8');
        } else {
            // log('Could not be converted to UTF-8');
        }
    }
    return $content;
}

The main problem was to figure out which encoding the input string is already using. Please tell me if my solution works for you as well!

Sign up to request clarification or add additional context in comments.

Comments

0

I ran into this problem while using json_encode. I use this to get everything into utf8. Source: https://www.php.net/manual/en/function.json-encode.php

function ascii_to_entities($str) 
    { 
       $count    = 1; 
       $out    = ''; 
       $temp    = array(); 
    
       for ($i = 0, $s = strlen($str); $i < $s; $i++) 
       { 
           $ordinal = ord($str[$i]); 
    
           if ($ordinal < 128) 
           { 
                if (count($temp) == 1) 
                { 
                    $out  .= '&#'.array_shift($temp).';'; 
                    $count = 1; 
                } 
            
                $out .= $str[$i]; 
           } 
           else 
           { 
               if (count($temp) == 0) 
               { 
                   $count = ($ordinal < 224) ? 2 : 3; 
               } 
        
               $temp[] = $ordinal; 
        
               if (count($temp) == $count) 
               { 
                   $number = ($count == 3) ? (($temp['0'] % 16) * 4096) + 
(($temp['1'] % 64) * 64) + 
($temp['2'] % 64) : (($temp['0'] % 32) * 64) + 
($temp['1'] % 64); 

                   $out .= '&#'.$number.';'; 
                   $count = 1; 
                   $temp = array(); 
               } 
           } 
       } 

       return $out; 
    } 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.