Spark JDBC reading wrong character encoding from PostgreSQL with server_encoding = SQL_ASCII

Question

I'm reading data from a PostgreSQL 8.4 database into PySpark using the JDBC connector. The database's server_encoding is SQL_ASCII.

When I query the table directly in pgAdmin, names like SÉRGIO or AURÉLIO display correctly. However, when I load the same data in Spark, I get broken characters such as:

S�RGIO MARTINS DOS SANTOS

Here’s how I’m connecting:

conn = (spark.read
    .format("jdbc")
    .option("url", "jdbc:postgresql://host:5432/dbname")
    .option("dbtable", "public.my_table")
    .option("user", user)
    .option("password", pw)
    .option("driver", "org.postgresql.Driver")
    .load()
)

I’ve tried adding:

.option("sessionInitStatement", "SET client_encoding TO 'WIN1252'") # or Latin1
.option("url", "jdbc:postgresql://host:5432/dbname?charSet=WIN1252")

but the characters are still garbled.

Question:

How can I force Spark (via JDBC) to decode the text correctly when the PostgreSQL server_encoding is SQL_ASCII?

What do you assume SQL_ASCII means and why aren't you using Unicode if you want to store non-US text? — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 12 at 13:58
From the PostgreSQL docs : When the server character set is SQL_ASCII, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII.. This is a critical bug that must be fixed. The question Postgres change Encoding SQL_ASCII to UTF8 addresses your real problem. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 12 at 14:00
A possible hack could be to tell the database driver (ie the JDBC PostgreSQL driver) to assume the text encoding is Latin1 and translate it to Unicode (Java strings are Unicode), but that may not be possible. This is 1000% a database design problem — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 12 at 14:13
And confirmation that you'll have to fix the database, you can't force the encoding. Laurenz Albe is one of the developers of PostgreSQL, so if he says you can't .... Try doing what he mentions. Dump the database, create a new one with WIN1252 and load the dump into the new database. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 12 at 14:28
Since this PostgreSQL instance is version 8.4, I can’t connect to it using pgAdmin 4. Because of that, I’m accessing it through phpPgAdmin instead. However, when I try to generate a dump using phpPgAdmin, the resulting .sql file comes out completely empty. — Thiago Luan
– Thiago Luan, Commented Nov 13 at 17:17

Laurenz Albe · Accepted Answer · 2025-11-13 20:36:08Z

1

The database encoding SQL_ASCII means “garbage in - garbage out”. PostgreSQL won't attempt any character set conversion, no matter how you set client_encoding. So the characters will arrive in your application just like they were sent to the database in the first place.

Now the JDBC driver expects data in UTF-8 encoding (and sets client_encoding appropriately), so it will gag if you stored the data in the database in some other encoding. Given the age of your database, perhaps the data were stored in a single-byte encoding (Windows?).

There is no way you can make this work. All you can do is pg_dump the database with a -E option that corresponds to the actual encoding of the data, which you have to figure out first. Then you can restore that dump to a database (with a recent version!) that was created with a reasonable server encoding, that is UTF8. If you are lucky, all data are in the same encoding, and your dump will restore without error. Otherwise, recovering your data will become more complicated.

Once your data are correctly stored in a UTF8 database, everything should work.

answered Nov 13 at 20:36

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Panagiotis Kanavos Nov 14 at 7:38

It a question comment the OP mentions they got an empty dump. It seems there are tool incompatibilities involved too ... reminds me of the 2017 Gitlab data loss

Laurenz Albe Nov 14 at 8:45

Yes, a plain pg_dump is the way forward.

Panagiotis Kanavos Nov 14 at 8:52

Assuming the DBA has proper dumps, could they use the latest without specifying -E ? I'm asking because it seems there are administration issues behind this question. If the OP can't take a new dump, could they ask the DBA for the latest one?

Laurenz Albe Nov 14 at 9:03

The alternative is to edit the (plain format) dump and change the SET client_encoding line to the actual encoding of the data.

Thiago Luan Nov 14 at 11:15

I will try this solution and bring news about the results soon. Thank you.

Collectives™ on Stack Overflow

Spark JDBC reading wrong character encoding from PostgreSQL with server_encoding = SQL_ASCII

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related