I'm trying to do something which seems like it should be very simple. I'm trying to see if a specific string e.g. 'out of stock' is found within a page's source code. However, I don't care if the string is contained within an html comment or javascript. So prior to doing my search, I'd like to remove both of these elements using regular expressions. This is the code I'm using.
urls.each do |url|
response = HTTP.get(url)
if response.status.success?
source_code = response.to_s
# Remove comments
source_code = source_code.gsub(/<!--(.*?)-->/su, '')
# Remove scripts
source_code = source_code.gsub(/<script(.*?)<\/script>/msu, '')
if source_code.match(/out of stock/i)
# Flag URL for further processing
end
end
end
end
This works for 99% of all the urls I tried it with, but certain urls have become problematic. When I try to use these regular expressions on the source code returned for the url "https://www.sunski.com" I get the following error message:
Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string))
The page is definitely UTF-8 encoded, so I don't really understand the error message. A few people on stack overflow recommended using the # encoding: UTF-8 comment at the top of the file, but this didn't work.
If anyone could help with this it would be hugely appreciated. Thank you!
mmodifier with your regex:/<script(.*?)<\/script>/msource_code.encodingreturns?<meta charset="utf-8">and in the Chrome console calleddocument.characterSet. Both seemed to indicate UTF-8 encoding but I see using the method you described that the encoding is not what I expected! Thanks.source_codeis guaranteed to be a binary string becauseNet::HTTPdoesn't care about the encoding of the website.