3

I've a Delphi 6 program (single byte characters) which sorts strings in a TStringList by the default case-insensitive AnsiCompareText function, which in turn calls the CompareStringA function in Windows kernel32.dll. (Regional settings are Hungarian.)

I'd like to do the same sorting in a PostgreSQL database, on a Kubuntu (linux-image-3.2.0-65-generic-pae, on 32 bit x86, KDE 4.8.5) system. It is created by

  CREATE DATABASE <...>
  WITH OWNER = postgres
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'hu_HU.UTF-8'
       LC_CTYPE = 'hu_HU.UTF-8'
       CONNECTION LIMIT = -1;

If I sort by C or POSIX, the accented characters are not sorted into their alphabetic order. If I sort by the default collation, spaces and some special characters are ignored. This is a problem when these occur at the beginning of the string. (Specifying the collation is easy since PostgreSQL 9.1: see http://www.postgresql.org/docs/9.3/static/collation.html.)

Several questions were asked in this topic, e.g. PostgreSQL Sort The answer there can't be generalized: it rules out the '@' at the first character position only.

My question is perhaps a duplicate of Is there any way to have PostgreSQL not collapse punctuation and spaces when collating using a language? The answer there directs to the TODO-list of PostgreSQL: http://wiki.postgresql.org/wiki/Todo:ICU Is there any change since then?

What I want is a collation which keeps spaces and special characters in their ASCII position, and sorts accented characters alphabetically - exactly as in Windows.

Do I have to write a custom locale (how)? Or a custom comparison function, written perhaps in Delphi (how do I add to PostgreSQL)? Or translating special characters to hexadecimal, for example - but then they will be sorted into the text. Translating ALL characters to hexadecimal (and mapping case and accent differences to the same code) seems terrible - it'd mean that I write the complete collation myself. I'm sure there should be a solution for this.

3
  • What exactly is your problem? Only case-insignificant ordering? Commented Jun 28, 2014 at 23:57
  • No, case-insignificant ordering would be easy with ORDER BY lower(myCol). Commented Jun 30, 2014 at 15:10
  • The problem is that PostgreSQL ignores (almost?) all punctuation marks from the strings when sorts. It might be useful in some cases, but there is no way to disable it. Several other similar questions: stackoverflow.com/questions/22534484/…, stackoverflow.com/questions/737447/…, postgresql.1045698.n5.nabble.com/… show that it is irritating for a number of developers too. Nick Barnes's solution is probably correct, but I didn't take the effort to develop it. Commented Jun 30, 2014 at 15:29

2 Answers 2

3

Unless you can change your database's encoding/collation to match your Windows system, I think adding some custom comparison code might be your only option.

If ICU's sort order (as described in the question you linked) is what you're after, then take a look at pg_collkey (a Postgres ICU wrapper). With this installed, it should just be a matter of replacing ORDER BY foo with ORDER BY collkey(foo,'hu_HU') (and likewise for any explicit > / < comparisons, and in any indexes these comparisons rely on).

If you want this to work invisibly (i.e. if you want to change the behaviour of ORDER BY foo), I think that would mean building a custom type, with its own supporting functions and operator classes. The citext (case-insensitive text) extension included with Postgres would serve as a useful starting point, but there's a lot to consider here, and it will likely be far from straightforward.

Sign up to request clarification or add additional context in comments.

Comments

0

Well, I give here my solution, though it is not an answer to the question, as it does not uses any collation and the result is not equivalent with the Delphi's sorting, and it is a PHP code, not PostgreSQL. However, the idea might help others to port it to PostgreSQL or any other language.

include 'portable-utf8.php';

$cCharTab = array(
     124 => '00',   // | (field separator)
      32 => '01',   // space
      43 => '11',   // +
      45 => '12',   // -
      47 => '14',   // /
      92 => '15',   // \
      61 => '17',   // =
    9658 => '19',   // ►
      34 => '22',   // "
      39 => '27',   // '
      40 => '28',   // (
      41 => '29',   // )
      42 => '2A',   // *
      46 => '2E',   // .

      48 => '30',   // 0
      49 => '31',   // 1
      50 => '32',   // 2
      51 => '33',   // 3
      52 => '34',   // 4
      53 => '35',   // 5
      54 => '36',   // 6
      55 => '37',   // 7
      56 => '38',   // 8
      57 => '39',   // 9

     164 => '64',   // ¤
      44 => '71',   // ,
      59 => '72',   // ;
     247 => '73',   // ÷
      58 => '73',   // :
      33 => '74',   // !
      36 => '75',   // $
      63 => '75',   // ?
      95 => '95',   // _

      65 => 'a0',   // A
      66 => 'b0',   // B
      67 => 'c0',   // C
      68 => 'd0',   // D
      69 => 'e0',   // E
      70 => 'f0',   // F
      71 => 'g0',   // G
      72 => 'h0',   // H
      73 => 'i0',   // I
      74 => 'j0',   // J
      75 => 'k0',   // K
      76 => 'l0',   // L
      77 => 'm0',   // M
      78 => 'n0',   // N
      79 => 'o0',   // O
      80 => 'p0',   // P
      81 => 'q0',   // Q
      82 => 'r0',   // R
      83 => 's0',   // S
      84 => 't0',   // T
      85 => 'u0',   // U
      86 => 'v0',   // V
      87 => 'w0',   // W
      88 => 'x0',   // X
      89 => 'y0',   // Y
      90 => 'z0',   // Z

      97 => 'a0',   // a
      98 => 'b0',   // b
      99 => 'c0',   // c
     100 => 'd0',   // d
     101 => 'e0',   // e
     102 => 'f0',   // f
     103 => 'g0',   // g
     104 => 'h0',   // h
     105 => 'i0',   // i
     106 => 'j0',   // j
     107 => 'k0',   // k
     108 => 'l0',   // l
     109 => 'm0',   // m
     110 => 'n0',   // n
     111 => 'o0',   // o
     112 => 'p0',   // p
     113 => 'q0',   // q
     114 => 'r0',   // r
     115 => 's0',   // s
     116 => 't0',   // t
     117 => 'u0',   // u
     118 => 'v0',   // v
     119 => 'w0',   // w
     120 => 'x0',   // x
     121 => 'y0',   // y
     122 => 'z0',   // z

     193 => 'a0',   // Á
     196 => 'a0',   // Ä
     201 => 'e0',   // É
     205 => 'i0',   // Í
     211 => 'o0',   // Ó
     214 => 'o1',   // Ö
     218 => 'u0',   // Ú
     220 => 'u1',   // Ü
     225 => 'a0',   // á
     228 => 'a0',   // ä
     231 => 'c0',   // ç
     233 => 'e0',   // é
     235 => 'e0',   // ë
     237 => 'i0',   // í
     243 => 'o0',   // ó
     246 => 'o1',   // ö
     250 => 'u0',   // ú
     252 => 'u1',   // ü
     253 => 'y0',   // ý
     263 => 'c0',   // ć
     268 => 'c0',   // Č
     269 => 'c0',   // č
     281 => 'e0',   // ę
     322 => 'l0',   // ł
     324 => 'n0',   // ń
     336 => 'o1',   // Ő
     337 => 'o1',   // ő
     345 => 'r0',   // ř
     353 => 's0',   // š
     367 => 'u0',   // ů
     368 => 'u1',   // Ű
     369 => 'u1',   // ű
     380 => 'z0'    // ż
);

// Sorter:
function Sorter( $a_str )
/*
    Convert $a_str to a sortable string.
*/
{
    $ct = $GLOBALS['cCharTab'];
    $result = '';
    $arr = preg_split('//u', $a_str, -1, PREG_SPLIT_NO_EMPTY);

    foreach ($arr as $c)
        $result .= $ct[utf8_ord($c)];

    return $result;
}

The Sorter function replaces each character of the values to be sorted with a two-character alphanumeric string, which is not affected by any locale. I have a separate column (f_sorter) for this in the table, filled by the INSERT statement from my PHP script which writes the table. (I have no UPDATEs, and I need only one ORDER BY in the application.)

It's something like this:

pg_query_params( $my_pg_connection, $sql, $params );

where

$sql = 'INSERT INTO my_table(f1, f2, f3, f_sorter)
        VALUES ($1, $2, $3, $4)';

and

$params = array( $f1, $f2, $f3, Sorter( $f1 . '|' . $f2. '|' . $f3) );

(Insert and update triggers and a server-side function would be more elegant.)

So

SELECT ...
ORDER BY f_sorter

gives the desired result of

SELECT ...
ORDER BY f1, f2, f3

with my "collation".

I use the '|' character as a field separator. It will be sorted before any other characters. The result is, that shorter strings will be before longer ones with the same prefix. (This is opposite of the Delphi result, but I like it.)

The $cCharTab array contains about 120 characters which were important for me. Feel free to fine tune the list, to change the ordering or change the field separator to TAB, for example.

portable-utf8 is a very useful library for handling UTF-8 strings in PHP 5. Download from http://pageconfig.com/post/portable-utf8

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.