0

I need to convert spark dataframe (large datasets) into pandas dataframe.

Code : spark_df = Example_df.toPandas()

I am getting this error:

/databricks/spark/python/pyspark/sql/pandas/conversion.py:145: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 30 tasks (31.0 GiB) is bigger than local result size limit 30.0 GiB, to address it, set spark.driver.maxResultSize bigger than your dataset result size.

1 Answer 1

1

TL;DR: The error is clear, you need to set spark.driver.maxResultSize so something bigger than 31Gb.

Longer answer: when running toPandas, you're requesting Spark to demand all executors to send data back to a single driver, and the driver's memory has to be big enough to hold that much data.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.