I run my python kmeans program at spark as below command:
./bin/spark-submit --master spark://master_ip:7077 my_kmeans.py
The main python kmeans program seems as below:
sc = spark.sparkContext
# data
X = jl.load('X.jl.z')
data_x = sc.parallelize(X)
# kmeans
model = KMeans.train(data_x, 10000, maxIterations=5)
The file 'X.jl.z' size is ~100M.
But I get the spark error:
File "/home/xxx/tmp/spark-2.0.2-bin-hadoop2.7/my_kmeans.py", line 24, in <module>
data_x = sc.parallelize(X)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
: java.lang.OutOfMemoryError: Java heap space
I know how to modify JVM heap size for Java program. But how can I increase the heap size for my python program ?