I'm reading data from a PostgreSQL 8.4 database into PySpark using the JDBC connector. The database's server_encoding is SQL_ASCII.
When I query the table directly in pgAdmin, names like SÉRGIO or AURÉLIO display correctly. However, when I load the same data in Spark, I get broken characters such as:
S�RGIO MARTINS DOS SANTOS
Here’s how I’m connecting:
conn = (spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://host:5432/dbname")
.option("dbtable", "public.my_table")
.option("user", user)
.option("password", pw)
.option("driver", "org.postgresql.Driver")
.load()
)
I’ve tried adding:
.option("sessionInitStatement", "SET client_encoding TO 'WIN1252'") # or Latin1
.option("url", "jdbc:postgresql://host:5432/dbname?charSet=WIN1252")
but the characters are still garbled.
Question:
How can I force Spark (via JDBC) to decode the text correctly when the PostgreSQL server_encoding is SQL_ASCII?
SQL_ASCIImeans and why aren't you using Unicode if you want to store non-US text?When the server character set is SQL_ASCII, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII.. This is a critical bug that must be fixed. The question Postgres change Encoding SQL_ASCII to UTF8 addresses your real problem.