Regex Error - (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)

Question

I'm trying to do something which seems like it should be very simple. I'm trying to see if a specific string e.g. 'out of stock' is found within a page's source code. However, I don't care if the string is contained within an html comment or javascript. So prior to doing my search, I'd like to remove both of these elements using regular expressions. This is the code I'm using.

urls.each do |url|
  response = HTTP.get(url)
     if response.status.success?
        source_code = response.to_s
        # Remove comments
        source_code = source_code.gsub(/<!--(.*?)-->/su, '')
        # Remove scripts
        source_code = source_code.gsub(/<script(.*?)<\/script>/msu, '')

        if source_code.match(/out of stock/i)
           # Flag URL for further processing
        end
     end
  end
end

This works for 99% of all the urls I tried it with, but certain urls have become problematic. When I try to use these regular expressions on the source code returned for the url "https://www.sunski.com" I get the following error message:

Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string))

The page is definitely UTF-8 encoded, so I don't really understand the error message. A few people on stack overflow recommended using the # encoding: UTF-8 comment at the top of the file, but this didn't work.

If anyone could help with this it would be hugely appreciated. Thank you!

You should only use m modifier with your regex: /<script(.*?)<\/script>/m — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 9, 2020 at 14:22
"The page is definitely UTF-8 encoded". Good rule for debugging: never assume, always check. Did you check what source_code.encoding returns? — Casper
– Casper, Commented Jul 9, 2020 at 14:52
@Casper I guess I have a fundamental misunderstanding how encoding works. I visited the page's source and saw the line <meta charset="utf-8"> and in the Chrome console called document.characterSet. Both seemed to indicate UTF-8 encoding but I see using the method you described that the encoding is not what I expected! Thanks. — Jordan Lagan
– Jordan Lagan, Commented Jul 9, 2020 at 14:59
@WiktorStribiżew This actually worked... I'm not quite sure how now that I know the encoding was actually mismatched. Thanks anyway. — Jordan Lagan
– Jordan Lagan, Commented Jul 9, 2020 at 15:00
@Casper source_code is guaranteed to be a binary string because Net::HTTP doesn't care about the encoding of the website. — cremno
– cremno, Commented Jul 9, 2020 at 15:10

cremno · Accepted Answer · 2020-07-09 15:19:02Z

The Net::HTTP standard library only returns binary (ASCII-8BIT) strings. See the long-standing feature request: Feature #2567: Net::HTTP does not handle encoding correctly. So if you want UTF-8 strings you have to manually set their encoding to UTF-8 with String#force_encoding:

source_code.force_encoding(Encoding::UTF_8)

If the website's character encoding isn't UTF-8 you have to implement a heuristic based on the Content-Type header or <meta>'s charset attribute but even then it might not be the correct encoding. You can validate a string's encoding with String#valid_encoding? if you need to deal with such cases. Thankfully most websites use UTF-8 nowadays.

Also as @WiktorStribiżew already wrote in the comments, the regexp encoding specifiers s (Windows-31J) and u (UTF-8) modifiers aren't necessary here and only very rarely are. Especially the latter one since modern Ruby defaults to UTF-8 (or, if sufficient, its subset US-ASCII) anyway. In other programming languages they may have a different meaning, e.g. in Perl s means single line.

Collectives™ on Stack Overflow

Regex Error - (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related