4

I have a web site that receives a CSV file by FTP once a month. For years it was an ASCII file. Now I'm receiving UTF-8 one month then UTF-16BE the next and UTF-16LE the month after that. Maybe I'll get UTF-32 next month. Fgets returns the byte order mark at the beginning of the UTF files. How can I get PHP to automatically recognize the character encoding? I had tried mb_detect_encoding and it returned ASCII regardless of the file type. I changed my code to read the BOM and explicitly put the character encoding into mb_convert_encoding. This worked until the latest file, which is UTF-16LE. In this file it reads the first line correctly and all subsequent lines show as question marks ("?"). What am I doing wrong?

$fhandle = fopen( $file_in, "r" );
if ( fhandle === false )
    {
    echo "<p class=redbold>Error opening file $file_in.</p>";
    die();
    }

$i = 0;
while( ( $line = fgets( $fhandle ) ) !== false )
{
$i++;

// Detect encoding on first line. Actual text always begins with string "Document"
if ( $i == 1 )
    {
    $line_start = substr( $line, 0, 4 );
    $line_start_hex = bin2hex( $line_start );
    $utf16_start = 'fffe4400';
    $utf8_start = 'efbbbf44';
    if ( strcmp( $line_start, 'Docu' ) == 0 )
        { $char_encoding = 'ASCII'; }
    elseif ( strcmp( $line_start_hex, 'efbbbf44' ) == 0 )
        {
        $char_encoding = 'UTF-8';
        $line = substr( $line, 3 );
        }
    elseif ( strcmp( $line_start_hex, 'fffe4400' ) == 0 )
        {
        $char_encoding = 'UTF-16LE';
        $line = substr( $line, 2 );
        }
    elseif ( strcmp( $line_start_hex, 'feff4400' ) == 0 )
        {
        $char_encoding = 'UTF-16BE';
        $line = substr( $line, 2 );
        }
    else
        {
        echo "<p class=redbold>Error, unknown character encoding. Line =<br>", $line_start_hex, '</p>';
        require( '../footer.php' );
        die();
        }
    echo "<p>char_encoding = $char_encoding</p>";
    }

// Convert UTF
if ( $char_encoding != 'ASCII' )
    {
    $line = mb_convert_encoding( $line, 'ASCII', $char_encoding);
    }

echo '<p>'; var_dump( $line ); echo '</p>';
}

Output:

    char_encoding = UTF-16LE

string(101) "DocumentNumber,RecordedTS,Title,PageCount,City,TransTaxAccountCode,TotalTransferTax,Description,Name
"

string(83) "???????????????????????????????????????????????????????????????????????????????????"

string(88) "????????????????????????????????????????????????????????????????????????????????????????"

string(84) "????????????????????????????????????????????????????????????????????????????????????"

string(80) "????????????????????????????????????????????????????????????????????????????????"

2 Answers 2

5

Explicitly pass the order and possible encodings to detect, and use strict parameter. Also please use file_get_contents, if the file is in UTF-16LE, fgets will screw it up for you.

<?php
header( "Content-Type: text/html; charset=utf-8");
$input = file_get_contents( $file_in );

$encoding = mb_detect_encoding( $input, array(
    "UTF-8",
    "UTF-32",
    "UTF-32BE",
    "UTF-32LE",
    "UTF-16",
    "UTF-16BE",
    "UTF-16LE"
), TRUE );

if( $encoding !== "UTF-8" ) {
    $input = mb_convert_encoding( $input, "UTF-8", $encoding );
}
echo "<p>$encoding</p>";

foreach( explode( PHP_EOL, $input ) as $line ) {
    var_dump( $line );
}

The order is important because UTF-8 and UTF-32 are more restrictive and UTF-16 is extremely permissive; pretty much any random even length of bytes are valid UTF-16.

The only way you will retain all information, is to convert it to an unicode encoding, not ASCII.

Sign up to request clarification or add additional context in comments.

Comments

1

My suggestion would be to just convert everything to UTF-8 or ASCII (not quite sure from the code you posted if you're trying to convert everything to UTF-8 or ASCII)

$utf8Line = iconv( mb_detect_encoding( $line ), 'UTF-8', $line );

or...

$asciiLine = iconv( mb_detect_encoding( $line ), 'ASCII', $line );

You can leverage mb_detect_encoding to do the heavy lifting for you

7 Comments

Unfortunately mb_detect_encoding seems to return "ASCII" for some of the UTF files.
whoops, missed that part of the question.. going back to drawing board
but ascii is a subset of unicode (the first 255 decimal), so they should convert easily. just convert to ascii and dont use multi-byte strings. Oh, and have you thought about maybe yelling at the people supplying the FTP data?
I have tried yelling at the people supplying the file, but yelling at a county agency is like talking to a brick wall. They just do whatever they do!
By "just convert to ASCII", do you mean some other technique that mb_convert_encoding() that I am now using?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.