2

I am using below JDBC URL in PySpark to write data frame to Azure SQL Database. However, I feel that the performance of this write operation is not up to the mark and can be improved by setting up a few extra properties. Are there any workarounds or any parameters that I can add so to improve the JDBC write performance?

jdbcUrl = "jdbc:sqlserver://server.database.windows.net:1433;databaseName=test;enablePrepareOnFirstPreparedStatementCall=false"

Below is the actual data frame write statement.

  data_frame.write \
            .mode('overwrite') \
            .format('jdbc') \
            .option('driver', jdbc_driver) \
            .option('user', user) \
            .option('password', password) \
            .option('url', jdbcUrl) \
            .option('dbtable', table + '_STG') \
            .save()
1
  • Try setting the job to use the full resources by specifying dynamic allocation property as true. Commented Jan 13, 2020 at 18:07

2 Answers 2

2

You can try to use Spark to SQL DB connector to write data to SQL database using bulk insert in Scala, please refer to the section Write data to Azure SQL database or SQL Server using Bulk Insert of Azure offical document Accelerate real-time big data analytics with Spark connector for Azure SQL Database and SQL Server, as the screenshot below.

enter image description here

So I think the problem for you now is how to pass a PySpark dataframe data_frame in Python to the code in Scala. You can use the function registerTempTable of a dataframe with a table name like temp_table as the code and figure below in a databricks python notebook.

# register a temp table for a dataframe in Python
data_frame.registerTempTable("temp_table")

%scala
val scalaDF = table("temp_table")

enter image description here

Then to run the bulk insert codes in Scala after %scala

%scala
import com.microsoft.azure.sqldb.spark.bulkcopy.BulkCopyMetadata
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._

/** 
  Add column Metadata.
  If not specified, metadata is automatically added
  from the destination table, which may suffer performance.
*/
var bulkCopyMetadata = new BulkCopyMetadata
bulkCopyMetadata.addColumnMetadata(1, "Title", java.sql.Types.NVARCHAR, 128, 0)
bulkCopyMetadata.addColumnMetadata(2, "FirstName", java.sql.Types.NVARCHAR, 50, 0)
bulkCopyMetadata.addColumnMetadata(3, "LastName", java.sql.Types.NVARCHAR, 50, 0)

val bulkCopyConfig = Config(Map(
  "url"               -> "mysqlserver.database.windows.net",
  "databaseName"      -> "MyDatabase",
  "user"              -> "username",
  "password"          -> "*********",
  "dbTable"           -> "dbo.Clients",
  "bulkCopyBatchSize" -> "2500",
  "bulkCopyTableLock" -> "true",
  "bulkCopyTimeout"   -> "600"
))

scalaDF.bulkCopyToSqlDB(bulkCopyConfig, bulkCopyMetadata)
Sign up to request clarification or add additional context in comments.

Comments

0

Performance can be optimized Using Apache Spark connector: SQL Server & Azure SQL -

First Install the com.microsoft.sqlserver.jdbc.spark Library using Maven Coordinate in the Data-bricks cluster and then use below code.

[https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver15][1]

df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite or append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED")\
.option("batchsize", as per your need)\
.save()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.