I am trying to connect to an Oracle DB using PySpark.
spark_config = SparkConf().setMaster(config['cluster']).setAppName('sim_transactions_test').set("jars", "..\Lib\ojdbc7.jar")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df_sim_input = self.sqlContext.read\
.format("jdbc")\
.option("driver", "oracle.jdbc.driver.OracleDriver")\
.option("url", config["db.url"])\
.option("dbtable", query)\
.option("user", config["db.user"])\
.option("password", config["db.password"])\
.load()
This gives me a
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
So it seems it cannot find the jar file in the SparkContext. It seems to be possible to load a PySpark shell with external jars, but I want to load them from the Python code.
Can someone explain to me how you can add this external jar from Python and make a query to an Oracle DB?
Extra question, how come that for a postgres DB the code works fine without importing an external jdbc? Is that because if it is installed on your system, it will automatically find it?