2

So I'm building a python extension for another application. I want to use pyspark in my extension to do some streaming, but I'm having trouble since the parent application calls my extension with plain old python.

I will not be able to change how the parent application calls my extension, so how can I launch pyspark or spark-submit from within my python code?

I actually haven't written yet my code for pyspark. I want to get the SparkContext up and running first. But for this question, let's use the word count example from the Spark website:

from pyspark import SparkContext


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: wordcount <file>", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="PythonWordCount")
    lines = sc.textFile(sys.argv[1], 1)
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

    sc.stop()

How can I run this application from another python program?

0

1 Answer 1

2

Okay so I finally figured it out. This seems to work for me

import sys,os
sys.path.append(os.environ['SPARK_HOME'] + '/python')
sys.path.append(os.environ['SPARK_HOME']+ '/python/build')
sys.path.append(os.environ['SPARK_HOME'] + '/python/pyspark')

#This part was neccesary for me, since I have a weird set up 
sys.path.insert('0' + '/usr/bin')

import pyspark
sc = pyspark.SparkContext()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.