1

I'm new to ruby and I'm trying to return a list of ASINs and corresponding prices using Ruby. I was able to get pretty close to what I need but would need help to answer 2 questions:

  1. How can I get rid of the [[' and >\n"]] around the ASIN (see result below)
  2. Is there a simpler way to extract the ASIN from the URL than using this regex?

Thanks so much for your help!

Here is what I get in the Terminal from the current code:

[["B00EJDIG8M\n"]] - $7.00
[["B00KJ07SEM\n"]] - $26.99
[["B000FAR33M\n"]] - $119.00
[["B00LLMKPVK\n"]] - $22.99
[["B007NXPAQG\n"]] - $9.47
[["B004W5WAMU\n"]] - $22.43
[["B00LFUNGU0\n"]] - $17.99
[["B0052G14E8\n"]] - $54.99
[["B002MPLYEW\n"]] - $212.99
[["B00009W3G7\n"]] - $6.61
[["B000NCTOUM\n"]] - $3.04
[["B009SANIDO\n"]] - $12.29
[["B0052G51AQ\n"]] - $67.99
[["B003XEUEPQ\n"]] - $26.74
[["B00CYH9HRO\n"]] - $25.75
[["B00KV0SKQK\n"]] - $21.99
[["B009PCI2JU\n"]] - $56.66
[["B00LLM6ZFK\n"]] - $24.99
[["B004RQDY60\n"]] - $18.40
[["B000JLNBW4\n"]] - $49.14

Here is the code:

require 'rubygems'
require 'nokogiri'   
require 'open-uri'
PAGE_URL = "http://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_0"

page = Nokogiri::HTML(open(PAGE_URL))
page.css(".zg_itemWrapper").each do |item|  
  price = item.at_css(".zg_price .price").text
  asin = item.at_css(".zg_title a")[:href].scan(/http:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)
  puts "#{asin} - #{price}"  
end  
1
  • What you're seeing is a doubly nested array with a string in it. If they're showing up in your output it's probably because of how you scraped them from the results. For instance, scan returns arrays like this. Commented Oct 29, 2014 at 5:42

2 Answers 2

3

Rather than cleaning up your Nokogiri search, the easiest thing to do at this point is just clean up your current asin values during interpolation. For example:

puts "#{asin.flatten.pop.chomp} - #{price}"
Sign up to request clarification or add additional context in comments.

1 Comment

asin.join.chomp could do it as well.
0

Regarding question 2., I realized I don't really need regex and found a way to get the same result with a much shorter line of code

replacing

asin = item.at_css(".zg_title a")[:href].scan(/http:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)

with

asin =  item.at_css(".zg_title a")[:href].split("/")[5].chomp

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.