Ruby 1.9: Regular Expressions with unknown input encoding

Question

Is there an accepted way to deal with regular expressions in Ruby 1.9 for which the encoding of the input is unknown? Let's say my input happens to be UTF-16 encoded:

x  = "foo<p>bar</p>baz"
y  = x.encode('UTF-16LE')
re = /<p>(.*)<\/p>/

x.match(re) 
=> #<MatchData "<p>bar</p>" 1:"bar">

y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)

My current approach is to use UTF-8 internally and re-encode (a copy of) the input if necessary:

if y.methods.include?(:encode)  # Ruby 1.8 compatibility
  if y.encoding.name != 'UTF-8'
    y = y.encode('UTF-8')
  end
end

y.match(/<p>(.*)<\/p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">

However, this feels a little awkward to me, and I wanted to ask if there's a better way to do it.

Myrddin Emrys · Accepted Answer · 2009-12-22 00:26:05Z

9

As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?

Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.

# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
  Regexp.new(pattern.encode(encoding),options)
end



  # Inside code looping through lines of input.
  # The variables 'regex' and 'line_encoding' should be initialized previously, to
  # persist across loops.
  if line.methods.include?(:encoding)  # Ruby 1.8 compatibility
    if line.encoding != last_encoding
      regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
      last_encoding = line.encoding
    end
  end
  line.match(regex)

In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.

answered Dec 22, 2009 at 0:26

Myrddin Emrys

44.4k12 gold badges44 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

DataWraith Over a year ago

Thanks! It had not occured to me to do it the other way round and encode the Regexp. That's indeed a lot faster! For anyone else trying to do this: Beware of dummy encodings (#dummy?) when you try to test your code. Took me a while to figure out why it wasn't working.

mahemoff Over a year ago

Agree about performance - I found it's exponentially faster to memoize the regex. Quick hack here to handle whitespace stripping: gist.github.com/mahemoff/c877eb1e955b1160dcdf6f4d4c0ba043

Sam · Accepted Answer · 2010-06-29 12:20:35Z

0

Follow the advice of this page: http://gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/ and add

# encoding: utf-8

to the top of your rb file.

answered Jun 29, 2010 at 12:20

Sam

6,2504 gold badges45 silver badges53 bronze badges

Collectives™ on Stack Overflow

Ruby 1.9: Regular Expressions with unknown input encoding

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related