Slow performance while writing data frame to Azure SQL database using PySpark JDBC

Question

I am using below JDBC URL in PySpark to write data frame to Azure SQL Database. However, I feel that the performance of this write operation is not up to the mark and can be improved by setting up a few extra properties. Are there any workarounds or any parameters that I can add so to improve the JDBC write performance?

jdbcUrl = "jdbc:sqlserver://server.database.windows.net:1433;databaseName=test;enablePrepareOnFirstPreparedStatementCall=false"

Below is the actual data frame write statement.

  data_frame.write \
            .mode('overwrite') \
            .format('jdbc') \
            .option('driver', jdbc_driver) \
            .option('user', user) \
            .option('password', password) \
            .option('url', jdbcUrl) \
            .option('dbtable', table + '_STG') \
            .save()

Try setting the job to use the full resources by specifying dynamic allocation property as true. — Jacob Celestine
– Jacob Celestine, Commented Jan 13, 2020 at 18:07

Peter Pan · Accepted Answer · 2020-01-14 09:35:03Z

You can try to use Spark to SQL DB connector to write data to SQL database using bulk insert in Scala, please refer to the section Write data to Azure SQL database or SQL Server using Bulk Insert of Azure offical document Accelerate real-time big data analytics with Spark connector for Azure SQL Database and SQL Server, as the screenshot below.

So I think the problem for you now is how to pass a PySpark dataframe data_frame in Python to the code in Scala. You can use the function registerTempTable of a dataframe with a table name like temp_table as the code and figure below in a databricks python notebook.

# register a temp table for a dataframe in Python
data_frame.registerTempTable("temp_table")

%scala
val scalaDF = table("temp_table")

Then to run the bulk insert codes in Scala after %scala

%scala
import com.microsoft.azure.sqldb.spark.bulkcopy.BulkCopyMetadata
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._

/** 
  Add column Metadata.
  If not specified, metadata is automatically added
  from the destination table, which may suffer performance.
*/
var bulkCopyMetadata = new BulkCopyMetadata
bulkCopyMetadata.addColumnMetadata(1, "Title", java.sql.Types.NVARCHAR, 128, 0)
bulkCopyMetadata.addColumnMetadata(2, "FirstName", java.sql.Types.NVARCHAR, 50, 0)
bulkCopyMetadata.addColumnMetadata(3, "LastName", java.sql.Types.NVARCHAR, 50, 0)

val bulkCopyConfig = Config(Map(
  "url"               -> "mysqlserver.database.windows.net",
  "databaseName"      -> "MyDatabase",
  "user"              -> "username",
  "password"          -> "*********",
  "dbTable"           -> "dbo.Clients",
  "bulkCopyBatchSize" -> "2500",
  "bulkCopyTableLock" -> "true",
  "bulkCopyTimeout"   -> "600"
))

scalaDF.bulkCopyToSqlDB(bulkCopyConfig, bulkCopyMetadata)

Techno_Eagle · Accepted Answer · 2021-11-21 19:44:39Z

0

Performance can be optimized Using Apache Spark connector: SQL Server & Azure SQL -

First Install the com.microsoft.sqlserver.jdbc.spark Library using Maven Coordinate in the Data-bricks cluster and then use below code.

[https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver15][1]

df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED")\
.option("batchsize", as per your need)\
.save()

answered Nov 21, 2021 at 19:44

Techno_Eagle

1116 bronze badges

Collectives™ on Stack Overflow

Slow performance while writing data frame to Azure SQL database using PySpark JDBC

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related