Not able to connect to postgres using jdbc in pyspark shell

Question

I am using standalone cluster on my local windows and trying to load data from one of our server using following code -

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", url="jdbc:postgresql://host/dbname", dbtable="schema.tablename")

I have set the SPARK_CLASSPATH as -

os.environ['SPARK_CLASSPATH'] = "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\postgresql-9.2-1002.jdbc3.jar"

While executing sqlContext.load, it throws error mentioning "No suitable driver found for jdbc:postgresql". I have tried searching web, but not able to find solution.

Its "No suitable driver found for jdbc:postgresql" only.. updated question. — Soni Shashank
– Soni Shashank, Commented Apr 16, 2015 at 9:22
Well in that case the required jar file with the driver is not available. — user330315
– user330315, Commented Apr 16, 2015 at 9:23
required jar file is present, but somehow SPARK is not able to recognize it. There is some issue regarding SPARK_CLASSPATH. I am not sure on how to set the SPARK_CLASSPATH. — Soni Shashank
– Soni Shashank, Commented Apr 16, 2015 at 9:26
"..\postgresql-9.2-1002.jdbc3" doesn't sound like the name of a jar file as they usually end in .jar. You need to add the jar file to the classpath, not the folder containing the jar file. — Mark Rotteveel
– Mark Rotteveel, Commented Apr 18, 2015 at 16:04

avkghost · Accepted Answer · 2017-09-22 09:30:39Z

May be it will be helpful.

In my environment SPARK_CLASSPATH contains path to postgresql connector

from pyspark import SparkContext, SparkConf
from pyspark.sql import DataFrameReader, SQLContext
import os

sparkClassPath = os.getenv('SPARK_CLASSPATH', '/path/to/connector/postgresql-42.1.4.jar')

# Populate configuration
conf = SparkConf()
conf.setAppName('application')
conf.set('spark.jars', 'file:%s' % sparkClassPath)
conf.set('spark.executor.extraClassPath', sparkClassPath)
conf.set('spark.driver.extraClassPath', sparkClassPath)
# Uncomment line below and modify ip address if you need to use cluster on different IP address
#conf.set('spark.master', 'spark://127.0.0.1:7077')

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

url = 'postgresql://127.0.0.1:5432/postgresql'
properties = {'user':'username', 'password':'password'}

df = DataFrameReader(sqlContext).jdbc(url='jdbc:%s' % url, table='tablename', properties=properties)

df.printSchema()
df.show()

This piece of code allows to use pyspark where you need. For example, I've used it in Django project.

Community · Accepted Answer · 2017-05-23 10:27:24Z

3

I had the same problem with mysql, and was never able to get it to work with the SPARK_CLASSPATH approach. However I did get it to work with extra command line arguments, see the answer to this question

To avoid having to click through to get it working, here's what you have to do:

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

edited May 23, 2017 at 10:27

CommunityBot

11 silver badge

answered Jun 20, 2015 at 0:02

8forty

5654 silver badges13 bronze badges

3 Comments

Sebastialonso Over a year ago

Uhm, the line of code is incomplete, right? Which flags need a value after them?

8forty Over a year ago

yep, the markup was lost somehow -- I edited it back in

Evan Zamir Over a year ago

Any idea how to do this in PyCharm?

Collectives™ on Stack Overflow

Not able to connect to postgres using jdbc in pyspark shell

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related