ruby fetching url content is always empty

Question

I am so frustrated trying to use Ruby to fetch a specific url content.

I've tried many different ways like open-uri, standard request none worked so far. I always get empty html. I also tried to use python to fetch the same url which always returned the correct html content. I am really not sure why... Please help as I am newbiew to both Ruby and Python... I want to use Ruby (prefer the tidy syntax and human friendly function names, easier to install libs using gem and homebrew (on mac) than python easy_install) but I am now considering Python because it just works (yet still trying to get my head around 2.x and 3.x issue). I may be doing something really stupid but I think is very unlikely.

ruby 1.9.2p136 (2010-12-25 revision 30365) [i386-darwin10.6.0]

Implementation 1:

url = URI.parse('http//:www.stackoverflow.com/') req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|   http.request(req) }    
puts res.body #empty

Implementation 2:

doc = Nokogiri::HTML(open("http//:www.stackoverflow.com/", "User-Agent" => "Safari"))
#empty
#I tried to use without user agent, without Nokogiri none worked.

Python Implementation which worked every time perfectly

f = urllib.urlopen("http//:www.stackoverflow.com/")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

print s

"http:www.url.com" is probably an example, ok, but what happened to the "//" part? anyway, you should post the real URL you are trying to download or there is nothing to test, only to guess. — tokland
– tokland, Commented Jan 31, 2011 at 21:33
It's interesting you say your Python works. I get a error saying there's an http error, "no host given". — the Tin Man
– the Tin Man, Commented Jan 31, 2011 at 22:06
for example www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia — Jeff
– Jeff, Commented Jan 31, 2011 at 22:34
Thanks very much for the responses I tested with the code given in below answers none worked so far. With above python code if you update the URL to yellow pages it will show the actual html. — Jeff
– Jeff, Commented Jan 31, 2011 at 22:40

Michael Papile · Accepted Answer · 2011-02-01 00:03:06Z

5

If that is your exact code it is invalid for several reasons.

http: should be http://
URL needs a path. if you want the root page of example.com it needs to be http://example.com/ the trailing slash is significant.
if you put 2 lines of code on one line you need to use ; to denote the end of the first line

SO

require 'net/http'

url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|   http.request(req) }    
puts res.body

Same is true with using open in nokogiri

EDIT: that site is returning bad results many times:

counter = 0

20.times do
  url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
  req = Net::HTTP::Get.new(url.path)
  res = Net::HTTP.start(url.host, url.port) {|http|   http.request(req) }    
  sleep 1
  counter +=1 unless res.body.empty?
end

puts counter

for me this only returned once a non empty body. If you substitute in another site it works all the time

curl "http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia"

Yields the same inconsistent results.

edited Feb 1, 2011 at 0:03

answered Jan 31, 2011 at 21:32

Michael Papile

6,86633 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Jeff Over a year ago

I've given the url I was testing with in a comment above. I've tested your code again with empty result.

Michael Papile Over a year ago

I am getting intermittent results with that site. I think it is returning empty body most of the time. Run that code a bunch of times and you will see. If I run it with yahoo.com it works every time.

Jeff Over a year ago

What I am curious to know is that why when I ran the python code it returns correct html every single time. Where as in the case of ruby code most of the time it returns empty result. I am still trying to suss it out. Because if Ruby lib isn't "reliable" then I should consider use python for my particular case.

Michael Papile Over a year ago

The ruby library is reliable, the site is not. I have no idea why it is running in python all the time. If I run CURL in my shell (nothing to do with Ruby) I get blank results half the time too. I do not think curl and Net::HTTP are broken I think the site is. Try running a similar example to mine in python (IE like a loop of 20 hits), I do not think you will be getting 100% results.

Jeff Over a year ago

Is there a way to increase the change of getting a more consistent result in Ruby? like longer time out etc... I also think the site has issues as I've tested with a few other sites just then most of them worked as expected.

|

steenslag · Accepted Answer · 2011-01-31 22:26:30Z

2

Two examples with openURI (standard lib), a wrapper for (among others) the rather cumbersome Net::HTTP :

require 'open-uri'

open("http://www.stackoverflow.com/"){|f| puts f.read}

puts URI::parse("http://www.google.com/").read

edited Jan 31, 2011 at 22:26

answered Jan 31, 2011 at 21:54

steenslag

80.2k16 gold badges144 silver badges174 bronze badges

1 Comment

Jeff Over a year ago

I've given the url I was testing with in a comment above. With the open() method I get an error, with URI::parse, I get empty result as I would normally get.

Collectives™ on Stack Overflow

ruby fetching url content is always empty

2 Answers 2

7 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related