1

I have a docker container running on my laptop with a master and three workers, I can launch the typical wordcount example by entering the ip of the master using a command like this:

bash-4.3# spark/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.scala.WordCount --master spark://spark-master:7077 /opt/spark-apps/learning-spark-mini-example_2.11-0.0.1.jar /opt/spark-data/README.md /opt/spark-data/output-5

I can see how the files have been generated inside output-5

but when I try to launch the process from outside, using the command:

docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS -e APP_ARGS="/opt/spark-data/README.md /opt/spark-data/output-5" spark-submit:2.4.0

Where

echo $SPARK_APPLICATION_JAR_LOCATION
/opt/spark-apps/learning-spark-mini-example_2.11-0.0.1.jar

echo $SPARK_APPLICATION_MAIN_CLASS
com.oreilly.learningsparkexamples.mini.scala.WordCount

And when I enter the page of the worker where the task is attempted, I can see that in line 11, the first of all, where the path for the first argument is collected, I have an error like this:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
    at com.oreilly.learningsparkexamples.mini.scala.WordCount$.main(WordCount.scala:11)

It is clear, in the zero position is not collecting the path of the first parameter, the one of the input file of which I want to do the wordcount.

The question is, why is docker not using the arguments passed through -e APP_ARGS="/opt/spark-data/README.md /opt/spark-data-output-5" ?

I already tried to run the job in a traditional way, loging to driver spark-master and running spark-submit command, but when i try to run the task with docker, it doesn't work.

It must be trivial, but i still have any clue. Can anybody help me?

SOLVED

I have to use a command like this:

docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS --env SPARK_APPLICATION_ARGS="/opt/spark-data/README.md /opt/spark-data/output-6" spark-submit:2.4.0

Resuming, i had to change -e APP_ARGS to --env SPARK_APPLICATION_ARGS

-e APP_ARGS is the suggested docker way...

1 Answer 1

1

This is the command that solves my problem:

docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS --env SPARK_APPLICATION_ARGS="/opt/spark-data/README.md /opt/spark-data/output-6" spark-submit:2.4.0

I have to use --env SPARK_APPLICATION_ARGS="args1 args2 argsN" instead of -e APP_ARGS="args1 args2 argsN".

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.