Extract data from URL with Ruby

Question

I'm new to ruby and I'm trying to return a list of ASINs and corresponding prices using Ruby. I was able to get pretty close to what I need but would need help to answer 2 questions:

How can I get rid of the [[' and >\n"]] around the ASIN (see result below)
Is there a simpler way to extract the ASIN from the URL than using this regex?

Thanks so much for your help!

Here is what I get in the Terminal from the current code:

[["B00EJDIG8M\n"]] - $7.00
[["B00KJ07SEM\n"]] - $26.99
[["B000FAR33M\n"]] - $119.00
[["B00LLMKPVK\n"]] - $22.99
[["B007NXPAQG\n"]] - $9.47
[["B004W5WAMU\n"]] - $22.43
[["B00LFUNGU0\n"]] - $17.99
[["B0052G14E8\n"]] - $54.99
[["B002MPLYEW\n"]] - $212.99
[["B00009W3G7\n"]] - $6.61
[["B000NCTOUM\n"]] - $3.04
[["B009SANIDO\n"]] - $12.29
[["B0052G51AQ\n"]] - $67.99
[["B003XEUEPQ\n"]] - $26.74
[["B00CYH9HRO\n"]] - $25.75
[["B00KV0SKQK\n"]] - $21.99
[["B009PCI2JU\n"]] - $56.66
[["B00LLM6ZFK\n"]] - $24.99
[["B004RQDY60\n"]] - $18.40
[["B000JLNBW4\n"]] - $49.14

Here is the code:

require 'rubygems'
require 'nokogiri'   
require 'open-uri'
PAGE_URL = "http://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_0"

page = Nokogiri::HTML(open(PAGE_URL))
page.css(".zg_itemWrapper").each do |item|  
  price = item.at_css(".zg_price .price").text
  asin = item.at_css(".zg_title a")[:href].scan(/http:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)
  puts "#{asin} - #{price}"  
end

What you're seeing is a doubly nested array with a string in it. If they're showing up in your output it's probably because of how you scraped them from the results. For instance, scan returns arrays like this. — tadman
– tadman, Commented Oct 29, 2014 at 5:42

Todd A. Jacobs · Accepted Answer · 2014-10-29 03:43:52Z

3

Rather than cleaning up your Nokogiri search, the easiest thing to do at this point is just clean up your current asin values during interpolation. For example:

puts "#{asin.flatten.pop.chomp} - #{price}"

answered Oct 29, 2014 at 3:43

Todd A. Jacobs

85.1k15 gold badges147 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tadman Over a year ago

asin.join.chomp could do it as well.

Sylvain · Accepted Answer · 2014-11-16 01:58:59Z

0

Regarding question 2., I realized I don't really need regex and found a way to get the same result with a much shorter line of code

replacing

asin = item.at_css(".zg_title a")[:href].scan(/http:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)

with

asin =  item.at_css(".zg_title a")[:href].split("/")[5].chomp

answered Nov 16, 2014 at 1:58

Sylvain

474 bronze badges

Collectives™ on Stack Overflow

Extract data from URL with Ruby

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related