PySpark application fail with java.lang.OutOfMemoryError: Java heap space

Question

I'm running spark via pycharm and respectively pyspark shell. I've stacked with this error:

: java.lang.OutOfMemoryError: Java heap space
    at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:416)
    at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:748)

My code is:

from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
import time

if __name__ == '__main__':

    print("Started at " + time.strftime("%H:%M:%S"))

    conf = (SparkConf()
            .setAppName("TestRdd") \
            .set('spark.driver.cores', '1') \
            .set('spark.executor.cores', '1') \
            .set('spark.driver.memory', '16G') \
            .set('spark.executor.memory', '9G'))
    sc = SparkContext(conf=conf)

    rdd = sc.parallelize(range(1000000000),100)

    print(rdd.take(10))

    print("Finished at " + time.strftime("%H:%M:%S"))

These are max memory settings,I can set on the cluster. I tried to allocate all memory to 1 core for creating rdd. But seems to me that application fails before distributing dataset. It fails on creating step I assume. Also I tried to set various number of partitions 100-10000. I've calculated how much memory it would take, so 1Billion of int - aproximately 4.5-4.7Gb in memory, less then I have, but no luck.

How can I optimize and force to run my code?

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

TL;DR Don't use parallelize outside tests and simple experiments. Because you use Python 2.7, range is not lazy, so you'll materialize a full range of values multiple types:

Python list after the call.
Serialized version which will be later written to disk.
Serialized copy loaded on JVM.

Using xrange would help, but you shouldn't use parallelize in the first place (or Python 2 in 2018).

If you want to create a series of values just use SparkContext.range

range(start, end=None, step=1, numSlices=None)

Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. Can be called the same way as python’s built-in range() function. If called with a single argument, the argument is interpreted as end, and start is set to 0.

so in your case:

rdd = sc.range(1000000000, numSlices=100)

With DataFrame:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.range(1000000000, numPartitions=100)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Feb 9, 2018 at 15:47

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Alper t. Turker Over a year ago

sc.range needs to store only the beginning and end of each partition.

Collectives™ on Stack Overflow

PySpark application fail with java.lang.OutOfMemoryError: Java heap space

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related