6

I'm writing data (approx. 83M records) from a dataframe into postgresql and it's kind of slow. Takes 2.7hrs to complete writing to db.

Looking at the executors, there is only one active task running on just one executor. Is there any way I could parallelize the writes into db using all executors in Spark?

...
val prop = new Properties()
prop.setProperty("user", DB_USER)
prop.setProperty("password", DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")



salesReportsDf.write
              .mode(SaveMode.Append)
              .jdbc(s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE", REPORTS_TABLE, prop)

Thanks

2
  • Can you add the part of the code is writing to the PostGres? Commented Sep 8, 2016 at 17:46
  • @ThiagoBaldim just posted the code snippet for that, thanks Commented Sep 8, 2016 at 17:53

2 Answers 2

5

So I figured out the problem. Basically, repartitioning my dataframe increase the database write throughput by 100%

def srcTable(config: Config): Map[String, String] = {

  val SERVER             = config.getString("db_host")
  val PORT               = config.getInt("db_port")
  val DATABASE           = config.getString("database")
  val USER               = config.getString("db_user")
  val PASSWORD           = config.getString("db_password")
  val TABLE              = config.getString("table")
  val PARTITION_COL      = config.getString("partition_column")
  val LOWER_BOUND        = config.getString("lowerBound")
  val UPPER_BOUND        = config.getString("upperBound")
  val NUM_PARTITION      = config.getString("numPartitions")

  Map(
    "url"     -> s"jdbc:postgresql://$SERVER:$PORT/$DATABASE",
    "driver"  -> "org.postgresql.Driver",
    "dbtable" -> TABLE,
    "user"    -> USER,
    "password"-> PASSWORD,
    "partitionColumn" -> PARTITION_COL,
    "lowerBound" -> LOWER_BOUND,
    "upperBound" -> UPPER_BOUND,
    "numPartitions" -> NUM_PARTITION
  )

}
Sign up to request clarification or add additional context in comments.

4 Comments

could you please provide more detials ( like the partions before repartition and after) on your answer i'm also facing similiar issue, your help in this regard will be greatly appreciated - Thanks
@user2359997 updated my answer, depending on the size of your table - you can specify the number of partitions so each executor can parallelize the ingestion of the data.
@AdetiloyePhilipKehinde lowerBound,upperBound, and partitionColumn shouldn't make any difference for writing to your database. These options "...applies only to reading" according to the docs: spark.apache.org/docs/latest/…
@AdetiloyePhilipKehinde could you specify the whole write query for better understanding that you used to speed up the process?
4

Spark also has a option called "batchsize" while writing using jdbc. The default value is pretty low.(1000)

connectionProperties.put("batchsize", "100000")

Setting it to much higher values should speed up writing to external DataBases.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.