python - PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Question

Recently, I want to parse websites and then use BeautifulSoup to filter what I want and write in csv file in hdfs.

Now, I am at the process of filtering website code with BeautifulSoup.

I want to use mapreduce method to execute it:

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.0.2.jar 
-mapper /pytemp/filter.py 
-input /user/root/py/input/ 
-output /user/root/py/output40/

input file is like kvs(PER LINE): (key, value) = (url, content)

content, I mean:

<html><head><title>...</title></head><body>...</body></html>

filter.py file:

#!/usr/bin/env python
#!/usr/bin/python
#coding:utf-8
from bs4 import BeautifulSoup
import sys

for line in sys.stdin:
    line = line.strip()
    key, content = line.split(",")

    #if the following two lines do not exist, the program will execute successfully
    soup = BeautifulSoup(content)
    output = soup.find()         

    print("Start-----------------")
    print("End------------------")

BTW, I think I do not need reduce.py to do my work.

However, I got error message:

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Here is a reply said it is memory issue but my input file just 3MB. http://grokbase.com/t/gg/rhadoop/13924fs4as/972-getting-error-pipemapred-waitoutputthreads-while-running-mapreduce-program-for-40mb-of-sizedataset

I have no idea about my problem. I search lots of things for it but still does not work.

My environment is:

CentOS6
Python2.7
Cloudera CDH5

I will appreciate your help with this situation.

EDIT on 2016/06/24

First of all, I checked error log and found the problem is too many values to unpack. (also thanks to @kynan answer)

Just give an example why it happened

<font color="#0000FF">
  SomeText1
  <font color="#0000FF">
    SomeText2
  </font>
</font>

If part of content is like above, and I call soup.find("font", color="#0000FF") and assign to output. It will cause two font to be assigned to one output, so that is why the error too many values to unpack

Solution

Just change output = soup.find() to (Var1, Var2, ...) = soup.find_all("font", color="#0000FF", limit=AmountOfVar) and work well :)

I fixed it! Find further error log which said "too many values to unpack". — Danny
– Danny, Commented Aug 19, 2014 at 7:14

kynan · Accepted Answer · 2015-12-31 17:10:02Z

2

This error usually means that the mapper process died. To find out why check the user logs in $HADOOP_PREFIX/logs/userlogs: there is one directory per job and inside one directory per container. In each container directory is a file stderr containing the output sent to stderr i.e. error messages.

answered Dec 31, 2015 at 17:10

kynan

13.7k6 gold badges82 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

chucknor Over a year ago

Hi. I am having the same issue described above. How do I access the user logs?

gae123 Over a year ago

For EMR/yarn you can find your logs from the WEB UI or on the cluster master shell as shown below (your application id will differ it is printed when the jobs starts). There is a lot of output, redirect it into a file as I show and look for python stack traces. $ yarn logs -applicationId application_1503951120983_0031 > /tmp/log

Collectives™ on Stack Overflow

python - PipeMapRed.waitOutputThreads(): subprocess failed with code 1

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related