2

I have a python script that is used to submit spark jobs using the spark-submit tool. I want to execute the command and write the output both to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu server.

This is what I have so far in my SubmitJob.py script

#!/usr/bin/python

# Submit the command
def submitJob(cmd, log_file):
    with open(log_file, 'w') as fh:
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print output.strip()
                fh.write(output)
        rc = process.poll()
        return rc

if __name__ == "__main__":
    cmdList = ["dse", "spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]
    log_file = "/tmp/out.log"
    exist_status = submitJob(cmdList, log_file)
    print "job finished with status ",exist_status

The strange thing is, when I execute the same command direcly in the shell it works fine and produces output on screen as the proggram proceeds.

So it looks like something is wrong in the way I'm using the subprocess.PIPE for stdout and writing the file.

What's the current recommended way to use subprocess module for writing to stdout and log file in real time line by line? I see bunch of options on the internet but not sure which is correct or latest.

thanks

1
  • Your for loop could be a bit thinner but otherwise, this should do it. I don't know spark or what it does with stdout, but that may be the better place to look. I think you should add a spark tag. And probably remove the bash tag. Commented Oct 13, 2016 at 17:50

2 Answers 2

3

Figured out what the problem was. I was trying to redirect both stdout n stderr to pipe to display on screen. This seems to block the stdout when stderr is present. If I remove the stderr=stdout argument from Popen, it works fine. So for spark-submit it looks like you don't need to redirect stderr explicitly as it already does this implicitly

Sign up to request clarification or add additional context in comments.

2 Comments

Do anyone have an idea whether this is a bug in spark-submit or in the Python module subprocess?
I believe this is because spark-submit redirects a lot of its output to stderr, so printing out to stdout will not get you the scripts actual output
0

To print the Spark log One can call the commandList given by user330612

  cmdList = ["spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]

Then it can be printed by using subprocess, remember to use communicate() to prevent deadlocks https://docs.python.org/2/library/subprocess.html Warning Deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that. Here below is the code to print the log.

import subprocess
p = subprocess.Popen(cmdList,stdout=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr = p.communicate() 
stderr=stderr.splitlines()
stdout=stdout.splitlines()
for line in stderr:
    print line  #now it can be printed line by line to a file or something else, for the log
for line in stdout:
    print line #for the output 

More information about subprocess and printing lines can be found at: https://pymotw.com/2/subprocess/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.