2

I'm trying to create a function which removes all none English characters (except spaces,dots and hyphens) from a string. For this I tried using preg_replace, but the function produces strange results.

I have a file called "example-נידדל.jpg"

Here is what I'm getting when trying to sanitize the file name:

echo preg_replace('/[^A-Za-z0-9\.]/','','example-נידדל.jpg');

The above produces: example.jpg as expected.

But when I try to pull the file name from a $_FILES array after uploading it to the server I get:

echo preg_replace('/[^A-Za-z0-9\.]/','',$_FILES['file_upload']["name"]);

The above produces example-15041497149114911500.jpg

The numbers I'm getting are in fact the HTML numbers of the characters which were suppose to be removed, see the following for character reference: http://realdev1.realise.com/rossa/phoneme/listCharactors.asp?start=1488&stop=1785&rows=297&page=1

I can't figure out why doesn't the preg_replace work with file names.

Can anyone help?

Thanks,

Roy

5
  • What's the character set on the page and on the form? Commented Jun 18, 2011 at 18:01
  • 1
    just echo $_FILES['file_upload']["name"] and see the results. Commented Jun 18, 2011 at 18:07
  • I've renamed an image to the name above and tried uploading it. Regardless of whether or not I specify an accept-charset for the form or add a charset meta tag, I always got it to return example.jpg. Are you sure the file you're uploading has isn't in fact named example-15041497149114911500.jpg before you upload it? Commented Jun 18, 2011 at 18:18
  • I checked now and the numbers I'm getting correlate to the respected HTML character that was suppose to be replaced: realdev1.realise.com/rossa/phoneme/… Commented Jun 18, 2011 at 20:35
  • @Roy Peleg - Please see if my answer solves your problem. Commented Jun 18, 2011 at 20:45

2 Answers 2

2

What about using mb_convert_encoding to convert the HTML entities back into UTF-8 before the preg_replace?

echo preg_replace('/[^A-Za-z0-9\.]/', '', mb_convert_encoding($_FILES['file_upload']["name"], 'UTF-8', 'HTML-ENTITIES'));
Sign up to request clarification or add additional context in comments.

Comments

1

I would use a combination of regular expressions and iconv to transliterate it.

Update: Prior transliteration/filtering the filename mabye needs to be urldecoded:

$path = urldecode($path); // convert triplets to bytes.

Here is a code example from here that does something very similar to your question:

function pathauto_cleanstring($string)
{
    $url = $string;
    $url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
    $url = trim($url, "-");
    $url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // TRANSLIT does the whole job
    $url = strtolower($url);
    $url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
    return $url;
}

It expects your into to be UTF-8 encoded.

Reference

2 Comments

Tried it, still the same result :-(
@Roy Peleg: I'm smelling that the filenname is urlencoded. So it needs to be urldecoded first. I'll add another piece of code, maybe it helps.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.