1

I run my python kmeans program at spark as below command:

./bin/spark-submit --master spark://master_ip:7077 my_kmeans.py

The main python kmeans program seems as below:

sc = spark.sparkContext
# data
X = jl.load('X.jl.z')
data_x = sc.parallelize(X)
# kmeans
model = KMeans.train(data_x, 10000, maxIterations=5)

The file 'X.jl.z' size is ~100M.

But I get the spark error:

  File "/home/xxx/tmp/spark-2.0.2-bin-hadoop2.7/my_kmeans.py", line 24, in <module>
    data_x = sc.parallelize(X)
py4j.protocol.Py4JJavaError: An error occurred while calling    z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.    
  : java.lang.OutOfMemoryError: Java heap space

I know how to modify JVM heap size for Java program. But how can I increase the heap size for my python program ?

1 Answer 1

2

Try to add the number of partitions:

data_x = sc.parallelize(X,n)
# n = 2-4 partitions for each CPU in your cluster

or :

Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the --driver-memory command line option in the client mode

Sign up to request clarification or add additional context in comments.

1 Comment

What if it is running on a local computer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.