0

I wrote a simple web scrawler using Mechanize, now I'm stuck at how to get next page recursively, below is the code.

def self.generate_page  #generate a Mechainze page object,the first page
    agent = Mechanize.new
    url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
     page = agent.get(url)
     page  
end

def self.next_page(n_page)  #get next page recursively by click   next tag showed in each pages
 puts n_page   
# if I dont use puts , I get nothing , when using puts, I get 
#<Mechanize::Page:0x007fd341c70fd0>
#<Mechanize::Page:0x007fd342f2ce08>
#<Mechanize::Page:0x007fd341d0cf70>
#<Mechanize::Page:0x007fd3424ff5c0>
#<Mechanize::Page:0x007fd341e1f660>
#<Mechanize::Page:0x007fd3425ec618>
#<Mechanize::Page:0x007fd3433f3e28>
#<Mechanize::Page:0x007fd3433a2410>
#<Mechanize::Page:0x007fd342446ca0>
#<Mechanize::Page:0x007fd343462490>
#<Mechanize::Page:0x007fd341c2fe18>
#<Mechanize::Page:0x007fd342d18040>
#<Mechanize::Page:0x007fd3432c76a8>  
#which are the results I want

    np = Mechanize.new.click(n_page.link_with(:text=>/next/)) unless n_page.link_with(:text=>/next/).nil?
     result = next_page(np) unless np.nil?
     result    # here the value is empty, I dont know what is worng
end

def  self.get_page  # trying to pass the result of next_page() method 
    puts  next_page(generate_page)
    # it seems result is never passed here, 
end

I followed these two links What is recursion and how does it work? and Ruby recursive function but still cant figure out what's wrong.. hope someone can help me out.. Thanks

1 Answer 1

2

There are a few issues with your code:

  1. You shouldn't be calling Mechanize.new more than once.
  2. From a stylistic perspective, you are doing too many nil checks.

Unless you have a preference for recursion, it'll probably be easier to do it iteratively.

To have your next_page method return an array containing every link page in the chain, you could write this:

# you should store the mechanize agent as a global variable
Agent = Mechanize.new

# a helper method to DRY up the code
def click_to_next_page(page)
  Agent.click(n_page.link_with(:text=>/next/))
end

# repeatedly visits next page until none exists
# returns all seen pages as an array
def get_all_next_pages(n_page)
   results = []
   np = click_to_next_page(n_page)
   results.push(np)
   until !np
     np = click_to_next_page(np)
     np && results.push(np)
   end
   results
end

# testing it out (i'm not actually running this)
base_url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
root_page = Agent.get(base_url)
next_pages = get_all_next_pages(root_page)
puts next_pages
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks!! at the first , I was going to do it iteratively, but failed at these lines : results.push(np) until !np np = click_to_next_page(np) np && results.push(np) your code really helped me a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.