So I'm building a python extension for another application. I want to use pyspark in my extension to do some streaming, but I'm having trouble since the parent application calls my extension with plain old python.
I will not be able to change how the parent application calls my extension, so how can I launch pyspark or spark-submit from within my python code?
I actually haven't written yet my code for pyspark. I want to get the SparkContext up and running first. But for this question, let's use the word count example from the Spark website:
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: wordcount <file>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1], 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
sc.stop()
How can I run this application from another python program?