Python how to resolve the Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2

Question

I am having an issue running a simple python code inside hadoop streaming. I have tried all the suggestions in the previous posts with a similar error and still having the issue.

added the usr/bin/env python
chmod a+x the mapper and reducer python code
put "" for the -mapper "python mapper.py -n 1 -r 0.4"

I have run the code outside and it worked well.

UPDATE: I run the code outside of hadoop streaming using the following code:

cat file |python mapper.py -n 5 -r 0.4 |sort|python reducer.py -f 3618

This works fine .. But now I need to run it to HADOOP STREAMING

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapreduce.job.reduces=5  \
-files lr \
-mapper "python lr/mapper.py -n 5 -r 0.4"  \
-reducer "python lr/reducer.py -f 3618"  \
-input training \
-output models

The hadoop streaming is the one failed. I looked at the logs and I did not see anything on it that told me why it is happening?

I have the following mapper.py:

#!/usr/bin/env python

import sys
import random

from optparse import OptionParser

parser = OptionParser()
parser.add_option("-n", "--model-num", action="store", dest="n_model",
                  help="number of models to train", type="int")
parser.add_option("-r", "--sample-ratio", action="store", dest="ratio",
                  help="ratio to sample for each ensemble", type="float")

options, args = parser.parse_args(sys.argv)

random.seed(8803)
r = options.ratio
for line in sys.stdin:
    # TODO
    # Note: The following lines are only there to help 
    #       you get started (and to have a 'runnable' program). 
    #       You may need to change some or all of the lines below.
    #       Follow the pseudocode given in the PDF.
    key = random.randint(0, options.n_model-1)
    value = line.strip()
    for i in range(1, options.n_model+1):
        m = random.random()
        if m < r:
            print "%d\t%s" % (i, value)

and my reducer.py:

#!/usr/bin/env python
import sys
import pickle
from optparse import OptionParser
from lrsgd import LogisticRegressionSGD
from utils import parse_svm_light_line

parser = OptionParser()
parser.add_option("-e", "--eta", action="store", dest="eta",
                  default=0.01, help="step size", type="float")
parser.add_option("-c", "--Regularization-Constant", action="store", dest="C",
                  default=0.0, help="regularization strength", type="float")
parser.add_option("-f", "--feature-num", action="store", dest="n_feature",
                  help="number of features", type="int")
options, args = parser.parse_args(sys.argv)

classifier = LogisticRegressionSGD(options.eta, options.C, options.n_feature)

for line in sys.stdin:
    key, value = line.split("\t", 1)
    value = value.strip()
    X, y = parse_svm_light_line(value)
    classifier.fit(X, y)

pickle.dump(classifier, sys.stdout)

When I run it outside the code, it runs OK, but when I run it inside hadoop-streaming it gives me the error:

17/02/07 07:44:34 INFO mapreduce.Job: Task Id : attempt_1486438814591_0038_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)

Is this a Pseudo distributed setup? If not, do you have the module lrsgd installed on all nodes? Also post the command you are using to submit the job. — franklinsijo
– franklinsijo, Commented Feb 7, 2017 at 8:26
@fanklinsijo i update the post to show you how i submit this.. — E B
– E B, Commented Feb 10, 2017 at 0:52
@franklinsijo, where do i check the number of datanodes of this test cluster that i am running — E B
– E B, Commented Feb 10, 2017 at 5:01
hdfs dfsadmin -report should give you the details of live Datanodes. — franklinsijo
– franklinsijo, Commented Feb 10, 2017 at 5:02
@franlinsijo ... i only have 1 datanode and it is alive.. on the report .. so i am so confuse where python issue is running.. to add i have the lrsgd.py and mapper.py and reducer.py all with (775) already too — E B
– E B, Commented Feb 11, 2017 at 20:10

Beck · Accepted Answer · 2018-10-22 14:55:01Z

4

Use the answer by Harishanker in the post - How to resolve java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2?

Make sure that the both the mapper and the reducer files are executable using chmod. (Eg: 'chmod 744 mapper.py')

Then make the streaming command as such:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapreduce.job.reduces=5  \
-files lr \
-mapper lr/mapper.py -n 5 -r 0.4  \
-reducer lr/reducer.py -f 3618  \
-input training \
-output models

Now it should work. Please comment if it doesn't.

answered Oct 22, 2018 at 14:55

Beck

412 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jianjun Hu · Accepted Answer · 2022-04-04 06:06:17Z

2

hadoop jar /home/maria_dev/hadoop-streaming-2.7.3.jar \
-file ./mapper.py -mapper 'python mapper.py' \
-file ./reducer.py -reducer 'python reducer.py' \
-input /user/maria_dev/wordcount/worddata.txt \
-output /user/maria_dev/output

this works for me. make sure each of file path is correct. originally, i forgot to specify -file for both python codes. and it does not work.

answered Apr 4, 2022 at 6:06

Jianjun Hu

411 gold badge1 silver badge3 bronze badges

1 Comment

baudo2048 Over a year ago

This works for me with hadoop 3.3.6

user3483192 · Accepted Answer · 2022-08-10 02:47:45Z

0

For complete noobs (like me!), make sure you have this in the first line of your .py files:

#!/usr/bin/env python

It's not just a comment, so don't accidentally delete it!

edited Aug 10, 2022 at 2:47

answered Aug 10, 2022 at 2:47

user3483192

11 bronze badge

Comments

Kamakshaiah Musunuru · Accepted Answer · 2025-01-24 09:17:04Z

0

The problem is Hadoop need to know that it is Python executable. I used #!/usr/bin/env python in the beginning of both files i.e., mapper.py and redcer.py. It works!

edited Jan 24 at 9:17

answered Jan 24 at 9:14

Kamakshaiah Musunuru

11 bronze badge

Collectives™ on Stack Overflow

Python how to resolve the Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related