2

(Not a duplicate of 4079956)

I have an SQL_ASCII database, LC_CTYPE=LC_COLLATION="C", which contains mostly ASCII data as well as some non-ASCII characters from some codepage, say LATIN1.

I want to transcode, in-place (no pg_dump/pg-restore), all non-ASCII codepoints from the LATIN1 codepage to UTF-8 then alter the database encoding to UTF-8, e.g.:

-- change encoding first, transcode data after
UPDATE pg_database SET encoding=pg_char_to_encoding('UTF8')
  WHERE datname='sqlasciidb';
UPDATE tbl SET str=convert_from(str::bytea, 'LATIN1')
  WHERE str::bytea<>convert_from(str::bytea, 'LATIN1')::bytea;

or

-- transcode data first, change encoding after
CREATE DOMAIN my_varlena AS bytea;
CREATE CAST (my_varlena AS text) WITHOUT FUNCTION;
UPDATE tbl SET str=convert(str::bytea, 'LATIN1','UTF8')::my_varlena::text
  WHERE str::bytea<>convert(str::bytea, 'LATIN1', 'UTF8');
DROP DOMAIN my_varlena CASCADE;
UPDATE pg_database SET encoding=pg_char_to_encoding('UTF8')
  WHERE datname='sqlasciidb';

What, if anything, is wrong with the above approach?

Some problems I can see:

  • after pg_database is updated, all connections to the database should be closed and reopened for the backend to take into account the new encoding
  • all indexes based on the altered columns should be rebuilt

Anything else?

1 Answer 1

2

Looks like you've got the main gist of it. I assume you've already tried this with a test database? I did give it a quick test when suggesting it to someone and it seemed to work ok for me, although this was far from a thorough test.

My gut feel is to transcode first and change encoding after, because while the database is still in SQL_ASCII, you aren't going to have to deal with errors from postgresql trying to interpret not-yet-transcoded or improperly-transcoded data, and can look at data with relative impunity. OTOH changing the encoding first guarantees that only subsequently-connecting backends will write data in UTF8...

Also have a check for things like function bodies, view definitions, constraint definitions etc. that may need transcoding too? (you'd hope not, but...)

Sign up to request clarification or add additional context in comments.

1 Comment

Turns out casting via domains is not truly supported. Awarding anyway. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.