27

need help with sorting words by utf-8. For example, we have 5 cities from Belgium.

$array = array('Borgloon','Thuin','Lennik','Éghezée','Aubel');
sort($array); // Expected: Aubel, Borgloon, Éghezée, Lennik, Thuin
              // Actual: Aubel, Borgloon, Lennik, Thuin, Éghezée

City Éghezée should be third. Is it possible to use/set some kind of utf-8 or create my own character order?

3
  • I just wanted to point out for future reference that natcasesort doesn't work out of the box: codepad.org/QgdF5DUY Commented Oct 28, 2011 at 13:25
  • Looks like there was similar question before: stackoverflow.com/questions/120334/… Commented Oct 28, 2011 at 13:33
  • Added a comment to reduce confusion as to what you're looking for versus what you get. Commented Oct 28, 2011 at 14:18

7 Answers 7

50

intl comes bundled with PHP from PHP 5.3 and it only supports UTF-8.

You can use a Collator in this case:

$array = array('Borgloon','Thuin','Lennik','Éghezée','Aubel');
$collator = new Collator('en_US');
$collator->sort($array);
print_r($array);

Output:

Array
(
    [0] => Aubel
    [1] => Borgloon
    [2] => Éghezée
    [3] => Lennik
    [4] => Thuin
)
Sign up to request clarification or add additional context in comments.

4 Comments

How can I sort it in reverse order?
just use array_reverse function on the result. php.net/manual/fr/function.array-reverse.php
Sort by key: uksort($array, static fn($a, $b) => $collator->compare($a, $b));
@Radek the callback for uksort() can be written as an array containing the class object then the method name. How can I sort an array by accented keys in PHP? ... if you need DESC directions, then $a and $b are useful.
12

I think you can use strcoll:

setlocale(LC_COLLATE, 'nl_BE.utf8');
$array = array('Borgloon','Thuin','Lennik','Éghezée','Aubel');
usort($array, 'strcoll'); 
print_r($array);

Result:

Array
(
    [0] => Aubel
    [1] => Borgloon
    [2] => Éghezée
    [3] => Lennik
    [4] => Thuin
)

You need the nl_BE.utf8 locale on your system:

fy@Heisenberg:~$ locale -a | grep nl_BE.utf8
nl_BE.utf8

If you are using debian you can use dpkg --reconfigure locales to add locales.

3 Comments

Thai's solution for PHP 5.3 seems clean too
strcoll don't work on Windows with utf-8, due CRT bogus implementation
Note that setlocale is not thread safe, so setting it back and forth might involve some risk of bad results.
8

This script should resolve in a custom way. I hope it help. Note the mb_strtolower function. You need to use it do make the function case insensitive. The reason why I didn't use the strtolower function is that it does not work well with special chars.

<?php

function customSort($a, $b) {
    static $charOrder = array('a', 'b', 'c', 'd', 'e', 'é',
                              'f', 'g', 'h', 'i', 'j',
                              'k', 'l', 'm', 'n', 'o',
                              'p', 'q', 'r', 's', 't',
                              'u', 'v', 'w', 'x', 'y', 'z');

    $a = mb_strtolower($a);
    $b = mb_strtolower($b);

    for($i=0;$i<mb_strlen($a) && $i<mb_strlen($b);$i++) {
        $chA = mb_substr($a, $i, 1);
        $chB = mb_substr($b, $i, 1);
        $valA = array_search($chA, $charOrder);
        $valB = array_search($chB, $charOrder);
        if($valA == $valB) continue;
        if($valA > $valB) return 1;
        return -1;
    }

    if(mb_strlen($a) == mb_strlen($b)) return 0;
    if(mb_strlen($a) > mb_strlen($b))  return -1;
    return 1;

}
$array = array('Borgloon','Thuin','Lennik','Éghezée','Aubel');
usort($array, 'customSort');

EDIT: Sorry. I made many mistakes in the last code. Now is tested.

EDIT {2}: Everything with multibyte functions.

4 Comments

Unfortunately this won't work, as $a[$i] will return a single byte from the string, not a single char.
Before, yes, you were right. I Changed the algorithm a few minutes ago. Using str_split will work.
str_split doesn't handle multibyte strings as well. :) See php.net/manual/en/function.mb-split.php#99851
don't run the strlen function that often, you only need to run them once upfront and you can already obtain the min value of both.
7

If you want to use native solution, so i can propose this one

function compare($a, $b)
{
        $alphabet = 'aąbcćdeęfghijklłmnnoóqprstuvwxyzźż'; // i used polish letters
        $a = mb_strtolower($a);
        $b = mb_strtolower($b);

        for ($i = 0; $i < mb_strlen($a); $i++) {
            if (mb_substr($a, $i, 1) == mb_substr($b, $i, 1)) {
                continue;
            }
            if ($i > mb_strlen($b)) {
                return 1;
            }
            if (mb_strpos($alphabet, mb_substr($a, $i, 1)) > mb_strpos($alphabet, mb_substr($b, $i, 1))) {
                return 1;
            } else {
                return -1;
            }
        }
}

usort($needed_array, 'compare');

Not sure, that is the best solution, but it works for me =)

2 Comments

Small update related to php 7 and new operator "spaceship". You can use <=> for return 1 or -1 in last condition.
You are missing ś character, so it should be: $alphabet = 'aąbcćdeęfghijklłmnnoóqprsśtuvwxyzźż'; And if You want to keep array keys, just use uksort function.
2

As for strcoll I guess it was a nice idea, but doesn't seem to work:

<?php

// Some 
$strings = array('Alpha', 'Älpha', 'Bravo');
// make it German: A, Ä, B
setlocale(LC_COLLATE, 'de_DE.UTF8', 'de.UTF8', 'de_DE.UTF-8', 'de.UTF-8');
usort($strings, 'strcoll');
var_dump($strings);
// as you can see, Ä is last, so this didn't work

A while back I wrote a UTF-8 to ASCII tool that would convert "älph#bla" to "aelph-bla". You could use this to "normalize" your input to make it sortable. It's basically a replacement similar to what @Nick said.

You should use a separate array for sorting, as calling urlify() in a usort() callback would be wasting a lot of resources. try

<?php
// data to sort
$array = array('Borgloon','Thuin','Lennik','Éghezée','Aubel');
// container for modified strings
$_array = array();
foreach ($array as $k => $v) {
    // "normalize" utf8 to ascii
    $_array[$k] = urlify($v);
}
// sort the ASCII stuff (while preserving indexes)
asort($_array);
foreach ($_array as $key => &$v) {
    // copy the original value of the ASCIIfied element
    $v = $array[$k];
}
var_dump($_array);

If you have PHP5.3 or the intl PECL compiled, try @Thai's solution, seems sweet!

Comments

2

There are great answers here, but this is a dead simple solution for most situations.

function globalsort($array, $in = 'UTF-8', $out = 'ASCII//TRANSLIT//IGNORE')
{
    return usort($array, function ($a, $b) use ($in, $out) {
        $a = @iconv($in, $out, $a);
        $b = @iconv($in, $out, $b);
        return strnatcasecmp($a, $b);
    });
}

And use it like so:

globalsort($array);

1 Comment

This does not work for me with letters like Ö, which should be sorted after O
1

I'd be tempted to loop through the array and convert to English characters before sorting. E.g.

<?php
  $array = array('Borgloon','Thuin','Lennik','Éghezée','Aubel');

  setlocale(LC_CTYPE, 'nl_BE.utf8');

  $newarray = array();
  foreach($array as $k => $v) {
    $newarray[$k] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $v);
  }

  sort($newarray);
  print_r($newarray);
?>

Probably not the best in terms of processing speed/resources used. But sure does make it easier to understand the code.

Edit:

Thinking about it now, you might be better using some kind of lookup table, something like this:

<?php
  $accentedCharacters = array ( 'à', 'á', 'â', 'ã', 'ä', 'å', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'Š', 'Ž', 'š', 'ž', 'Ÿ', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý' ); 

  $replacementCharacters = array ( 'a', 'a', 'a', 'a', 'a', 'a', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'S', 'Z', 's', 'z', 'Y', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y' );

  $array = array('Borgloon','Thuin','Lennik','Éghezée','Aubel');

  $newarray = array();
  foreach($array as $k => $v) {
    $newarray[$k] = str_replace($accentedCharacters,$replacementCharacters,$v);
  }

  sort($newarray);
  print_r($newarray);
?>

2 Comments

Why do you propose nl_BE? (Dutch as spoken/written in Belgium)
Honestly, it was the first locale that came to mind that would work given that dataset. Thinking about it now, he might be better using a conversion lookup table instead if the dataset is going to use other abnormal characters.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.