3

Recently, I want to parse websites and then use BeautifulSoup to filter what I want and write in csv file in hdfs.

Now, I am at the process of filtering website code with BeautifulSoup.

I want to use mapreduce method to execute it:

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.0.2.jar 
-mapper /pytemp/filter.py 
-input /user/root/py/input/ 
-output /user/root/py/output40/

input file is like kvs(PER LINE): (key, value) = (url, content)

content, I mean:

<html><head><title>...</title></head><body>...</body></html>

filter.py file:

#!/usr/bin/env python
#!/usr/bin/python
#coding:utf-8
from bs4 import BeautifulSoup
import sys

for line in sys.stdin:
    line = line.strip()
    key, content = line.split(",")

    #if the following two lines do not exist, the program will execute successfully
    soup = BeautifulSoup(content)
    output = soup.find()         

    print("Start-----------------")
    print("End------------------")

BTW, I think I do not need reduce.py to do my work.

However, I got error message:

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Here is a reply said it is memory issue but my input file just 3MB. http://grokbase.com/t/gg/rhadoop/13924fs4as/972-getting-error-pipemapred-waitoutputthreads-while-running-mapreduce-program-for-40mb-of-sizedataset

I have no idea about my problem. I search lots of things for it but still does not work.

My environment is:

  1. CentOS6
  2. Python2.7
  3. Cloudera CDH5

I will appreciate your help with this situation.

EDIT on 2016/06/24

First of all, I checked error log and found the problem is too many values to unpack. (also thanks to @kynan answer)

Just give an example why it happened

<font color="#0000FF">
  SomeText1
  <font color="#0000FF">
    SomeText2
  </font>
</font>

If part of content is like above, and I call soup.find("font", color="#0000FF") and assign to output. It will cause two font to be assigned to one output, so that is why the error too many values to unpack

Solution

Just change output = soup.find() to (Var1, Var2, ...) = soup.find_all("font", color="#0000FF", limit=AmountOfVar) and work well :)

4
  • I fixed it! Find further error log which said "too many values to unpack". Commented Aug 19, 2014 at 7:14
  • 2
    I think you should answer this question and accept it Commented Dec 12, 2014 at 3:33
  • Can you answer the question ? Commented Dec 6, 2015 at 15:25
  • 1
    @Danny can you explain how did you solved it ? Commented Jun 22, 2016 at 8:16

1 Answer 1

2

This error usually means that the mapper process died. To find out why check the user logs in $HADOOP_PREFIX/logs/userlogs: there is one directory per job and inside one directory per container. In each container directory is a file stderr containing the output sent to stderr i.e. error messages.

Sign up to request clarification or add additional context in comments.

2 Comments

Hi. I am having the same issue described above. How do I access the user logs?
For EMR/yarn you can find your logs from the WEB UI or on the cluster master shell as shown below (your application id will differ it is printed when the jobs starts). There is a lot of output, redirect it into a file as I show and look for python stack traces. $ yarn logs -applicationId application_1503951120983_0031 > /tmp/log

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.