2

I'm trying to do something which seems like it should be very simple. I'm trying to see if a specific string e.g. 'out of stock' is found within a page's source code. However, I don't care if the string is contained within an html comment or javascript. So prior to doing my search, I'd like to remove both of these elements using regular expressions. This is the code I'm using.

urls.each do |url|
  response = HTTP.get(url)
     if response.status.success?
        source_code = response.to_s
        # Remove comments
        source_code = source_code.gsub(/<!--(.*?)-->/su, '')
        # Remove scripts
        source_code = source_code.gsub(/<script(.*?)<\/script>/msu, '')

        if source_code.match(/out of stock/i)
           # Flag URL for further processing
        end
     end
  end
end

This works for 99% of all the urls I tried it with, but certain urls have become problematic. When I try to use these regular expressions on the source code returned for the url "https://www.sunski.com" I get the following error message:

Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string))

The page is definitely UTF-8 encoded, so I don't really understand the error message. A few people on stack overflow recommended using the # encoding: UTF-8 comment at the top of the file, but this didn't work.

If anyone could help with this it would be hugely appreciated. Thank you!

5
  • 1
    You should only use m modifier with your regex: /<script(.*?)<\/script>/m Commented Jul 9, 2020 at 14:22
  • 1
    "The page is definitely UTF-8 encoded". Good rule for debugging: never assume, always check. Did you check what source_code.encoding returns? Commented Jul 9, 2020 at 14:52
  • @Casper I guess I have a fundamental misunderstanding how encoding works. I visited the page's source and saw the line <meta charset="utf-8"> and in the Chrome console called document.characterSet. Both seemed to indicate UTF-8 encoding but I see using the method you described that the encoding is not what I expected! Thanks. Commented Jul 9, 2020 at 14:59
  • @WiktorStribiżew This actually worked... I'm not quite sure how now that I know the encoding was actually mismatched. Thanks anyway. Commented Jul 9, 2020 at 15:00
  • @Casper source_code is guaranteed to be a binary string because Net::HTTP doesn't care about the encoding of the website. Commented Jul 9, 2020 at 15:10

1 Answer 1

3

The Net::HTTP standard library only returns binary (ASCII-8BIT) strings. See the long-standing feature request: Feature #2567: Net::HTTP does not handle encoding correctly. So if you want UTF-8 strings you have to manually set their encoding to UTF-8 with String#force_encoding:

source_code.force_encoding(Encoding::UTF_8)

If the website's character encoding isn't UTF-8 you have to implement a heuristic based on the Content-Type header or <meta>'s charset attribute but even then it might not be the correct encoding. You can validate a string's encoding with String#valid_encoding? if you need to deal with such cases. Thankfully most websites use UTF-8 nowadays.

Also as @WiktorStribiżew already wrote in the comments, the regexp encoding specifiers s (Windows-31J) and u (UTF-8) modifiers aren't necessary here and only very rarely are. Especially the latter one since modern Ruby defaults to UTF-8 (or, if sufficient, its subset US-ASCII) anyway. In other programming languages they may have a different meaning, e.g. in Perl s means single line.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.