Performance issues in loading data from Databricks to Azure SQL

Question

I am trying to load 1 million records from Delta table in Databricks to Azure SQL database using the recently released connector by Microsoft supporting Python API and Spark 3.0.

Performance does not really look awesome to me. It takes 19 minutes to load 1 million records. Below is the code which I am using. Do you think I am missing something here?

Configurations: 8 Worker nodes with 28GB memory and 8 cores. Azure SQL database is a 4 vcore Gen5 .

try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", "lending_club_acc_loans") \
    .option("user", username) \
    .option("password", password) \
    .option("tableLock", "true") \
    .option("batchsize", "200000") \
    .option("reliabilityLevel", "BEST_EFFORT") \
    .save()
except ValueError as error :
    print("Connector write failed", error)

Is there something I can do to boost the performance?

Have you identified if this is at the Databricks or the Azure SQL side? What do you see as Wait Types in SQL when you run this? You could simply be throttled by the Log Throughput on the SQL side - in which case the resolution is to scale up your SQL DB for the duration of the load then scale down again — Martin Cairney
– Martin Cairney, Commented Feb 24, 2021 at 22:49
I am not sure Martin where the problem is. I don't think scaling the DB would make any diff as the database is a really powerful one. The I/O on the DB is really low and under control. So am assuming that the problem is on the Databricks end. Something to do with the options here. — Mayank Srivastava
– Mayank Srivastava, Commented Feb 24, 2021 at 22:58
Are you certain about the database? Have a look at the following youtube.com/… start from 1:39:00 to get the info on Log Rate Governance which can still occur even if the CPU consumption and Data I/O looks low — Martin Cairney
– Martin Cairney, Commented Feb 24, 2021 at 23:18
Ok. So did what you said. Kinda helps... The CPU and memory wait times are very low and acceptable. But the Network Io wait time is pretty high. — Mayank Srivastava
– Mayank Srivastava, Commented Feb 25, 2021 at 17:44

Mayank Srivastava · Accepted Answer · 2021-03-06 12:55:54Z

2

Repartition the data frame. Earlier I had single partition on my source data frame which upon re-partition to 8 helped improve the performance.

answered Mar 6, 2021 at 12:55

Mayank Srivastava

871 gold badge3 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Performance issues in loading data from Databricks to Azure SQL

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related