1

I'm reading data from a PostgreSQL 8.4 database into PySpark using the JDBC connector. The database's server_encoding is SQL_ASCII.

When I query the table directly in pgAdmin, names like SÉRGIO or AURÉLIO display correctly. However, when I load the same data in Spark, I get broken characters such as:

S�RGIO MARTINS DOS SANTOS

Here’s how I’m connecting:

conn = (spark.read
    .format("jdbc")
    .option("url", "jdbc:postgresql://host:5432/dbname")
    .option("dbtable", "public.my_table")
    .option("user", user)
    .option("password", pw)
    .option("driver", "org.postgresql.Driver")
    .load()
)

I’ve tried adding:

.option("sessionInitStatement", "SET client_encoding TO 'WIN1252'") # or Latin1
.option("url", "jdbc:postgresql://host:5432/dbname?charSet=WIN1252")

but the characters are still garbled.

Question:

How can I force Spark (via JDBC) to decode the text correctly when the PostgreSQL server_encoding is SQL_ASCII?

8
  • What do you assume SQL_ASCII means and why aren't you using Unicode if you want to store non-US text? Commented Nov 12 at 13:58
  • 1
    From the PostgreSQL docs : When the server character set is SQL_ASCII, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII.. This is a critical bug that must be fixed. The question Postgres change Encoding SQL_ASCII to UTF8 addresses your real problem. Commented Nov 12 at 14:00
  • A possible hack could be to tell the database driver (ie the JDBC PostgreSQL driver) to assume the text encoding is Latin1 and translate it to Unicode (Java strings are Unicode), but that may not be possible. This is 1000% a database design problem Commented Nov 12 at 14:13
  • And confirmation that you'll have to fix the database, you can't force the encoding. Laurenz Albe is one of the developers of PostgreSQL, so if he says you can't .... Try doing what he mentions. Dump the database, create a new one with WIN1252 and load the dump into the new database. Commented Nov 12 at 14:28
  • Since this PostgreSQL instance is version 8.4, I can’t connect to it using pgAdmin 4. Because of that, I’m accessing it through phpPgAdmin instead. However, when I try to generate a dump using phpPgAdmin, the resulting .sql file comes out completely empty. Commented Nov 13 at 17:17

1 Answer 1

1

The database encoding SQL_ASCII means “garbage in - garbage out”. PostgreSQL won't attempt any character set conversion, no matter how you set client_encoding. So the characters will arrive in your application just like they were sent to the database in the first place.

Now the JDBC driver expects data in UTF-8 encoding (and sets client_encoding appropriately), so it will gag if you stored the data in the database in some other encoding. Given the age of your database, perhaps the data were stored in a single-byte encoding (Windows?).

There is no way you can make this work. All you can do is pg_dump the database with a -E option that corresponds to the actual encoding of the data, which you have to figure out first. Then you can restore that dump to a database (with a recent version!) that was created with a reasonable server encoding, that is UTF8. If you are lucky, all data are in the same encoding, and your dump will restore without error. Otherwise, recovering your data will become more complicated.

Once your data are correctly stored in a UTF8 database, everything should work.

Sign up to request clarification or add additional context in comments.

5 Comments

It a question comment the OP mentions they got an empty dump. It seems there are tool incompatibilities involved too ... reminds me of the 2017 Gitlab data loss
Yes, a plain pg_dump is the way forward.
Assuming the DBA has proper dumps, could they use the latest without specifying -E ? I'm asking because it seems there are administration issues behind this question. If the OP can't take a new dump, could they ask the DBA for the latest one?
The alternative is to edit the (plain format) dump and change the SET client_encoding line to the actual encoding of the data.
I will try this solution and bring news about the results soon. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.