Converting in-place a Postgresql database from SQL_ASCII to UTF8

Question

(Not a duplicate of 4079956)

I have an SQL_ASCII database, LC_CTYPE=LC_COLLATION="C", which contains mostly ASCII data as well as some non-ASCII characters from some codepage, say LATIN1.

I want to transcode, in-place (no pg_dump/pg-restore), all non-ASCII codepoints from the LATIN1 codepage to UTF-8 then alter the database encoding to UTF-8, e.g.:

-- change encoding first, transcode data after
UPDATE pg_database SET encoding=pg_char_to_encoding('UTF8')
  WHERE datname='sqlasciidb';
UPDATE tbl SET str=convert_from(str::bytea, 'LATIN1')
  WHERE str::bytea<>convert_from(str::bytea, 'LATIN1')::bytea;

or

-- transcode data first, change encoding after
CREATE DOMAIN my_varlena AS bytea;
CREATE CAST (my_varlena AS text) WITHOUT FUNCTION;
UPDATE tbl SET str=convert(str::bytea, 'LATIN1','UTF8')::my_varlena::text
  WHERE str::bytea<>convert(str::bytea, 'LATIN1', 'UTF8');
DROP DOMAIN my_varlena CASCADE;
UPDATE pg_database SET encoding=pg_char_to_encoding('UTF8')
  WHERE datname='sqlasciidb';

What, if anything, is wrong with the above approach?

Some problems I can see:

after pg_database is updated, all connections to the database should be closed and reopened for the backend to take into account the new encoding
all indexes based on the altered columns should be rebuilt

Anything else?

araqnid · Accepted Answer · 2011-03-10 21:35:38Z

2

Looks like you've got the main gist of it. I assume you've already tried this with a test database? I did give it a quick test when suggesting it to someone and it seemed to work ok for me, although this was far from a thorough test.

My gut feel is to transcode first and change encoding after, because while the database is still in SQL_ASCII, you aren't going to have to deal with errors from postgresql trying to interpret not-yet-transcoded or improperly-transcoded data, and can look at data with relative impunity. OTOH changing the encoding first guarantees that only subsequently-connecting backends will write data in UTF8...

Also have a check for things like function bodies, view definitions, constraint definitions etc. that may need transcoding too? (you'd hope not, but...)

answered Mar 10, 2011 at 21:35

araqnid

135k25 gold badges163 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

vladr Over a year ago

Turns out casting via domains is not truly supported. Awarding anyway. :)

Collectives™ on Stack Overflow

Converting in-place a Postgresql database from SQL_ASCII to UTF8

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related