Spark write to postgres slow

Question

I'm writing data (approx. 83M records) from a dataframe into postgresql and it's kind of slow. Takes 2.7hrs to complete writing to db.

Looking at the executors, there is only one active task running on just one executor. Is there any way I could parallelize the writes into db using all executors in Spark?

...
val prop = new Properties()
prop.setProperty("user", DB_USER)
prop.setProperty("password", DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")



salesReportsDf.write
              .mode(SaveMode.Append)
              .jdbc(s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE", REPORTS_TABLE, prop)

Thanks

Can you add the part of the code is writing to the PostGres? — Thiago Baldim
– Thiago Baldim, Commented Sep 8, 2016 at 17:46

xgMz · Accepted Answer · 2017-03-14 18:50:09Z

5

So I figured out the problem. Basically, repartitioning my dataframe increase the database write throughput by 100%

def srcTable(config: Config): Map[String, String] = {

  val SERVER             = config.getString("db_host")
  val PORT               = config.getInt("db_port")
  val DATABASE           = config.getString("database")
  val USER               = config.getString("db_user")
  val PASSWORD           = config.getString("db_password")
  val TABLE              = config.getString("table")
  val PARTITION_COL      = config.getString("partition_column")
  val LOWER_BOUND        = config.getString("lowerBound")
  val UPPER_BOUND        = config.getString("upperBound")
  val NUM_PARTITION      = config.getString("numPartitions")

  Map(
    "url"     -> s"jdbc:postgresql://$SERVER:$PORT/$DATABASE",
    "driver"  -> "org.postgresql.Driver",
    "dbtable" -> TABLE,
    "user"    -> USER,
    "password"-> PASSWORD,
    "partitionColumn" -> PARTITION_COL,
    "lowerBound" -> LOWER_BOUND,
    "upperBound" -> UPPER_BOUND,
    "numPartitions" -> NUM_PARTITION
  )

}

edited Mar 14, 2017 at 18:50

xgMz

3,3642 gold badges32 silver badges23 bronze badges

answered Sep 11, 2016 at 22:09

Philip K. Adetiloye

3,2705 gold badges39 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user2359997 Over a year ago

could you please provide more detials ( like the partions before repartition and after) on your answer i'm also facing similiar issue, your help in this regard will be greatly appreciated - Thanks

Philip K. Adetiloye Over a year ago

@user2359997 updated my answer, depending on the size of your table - you can specify the number of partitions so each executor can parallelize the ingestion of the data.

ecoe Over a year ago

@AdetiloyePhilipKehinde lowerBound,upperBound, and partitionColumn shouldn't make any difference for writing to your database. These options "...applies only to reading" according to the docs: spark.apache.org/docs/latest/…

Nikunj Kakadiya Over a year ago

@AdetiloyePhilipKehinde could you specify the whole write query for better understanding that you used to speed up the process?

Soumyadip Ghosh · Accepted Answer · 2018-05-18 10:28:35Z

4

Spark also has a option called "batchsize" while writing using jdbc. The default value is pretty low.(1000)

connectionProperties.put("batchsize", "100000")

Setting it to much higher values should speed up writing to external DataBases.

answered May 18, 2018 at 10:28

Soumyadip Ghosh

1881 gold badge2 silver badges11 bronze badges

Collectives™ on Stack Overflow

Spark write to postgres slow

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related