Ruby 1.9 regex encoding

Question

I am parsing this feed http://www.sixapart.com/labs/update/developers/ with nokogiri and then running some regex on the contents of some tags. The content is UTF-8 mostly, but is occasionally corrupt. However, for my case I don't really care and just need to pass the right parts of the content through, so I'm happy to treat the data as binary/ASCII-8BIT. The problem is that no matter what I do, regexes in my script are treated as either UTF-8 or ASCII. No matter what I set the encoding comment to, or what I do to create the regex.

Is there a solution to this? Can I force the regex to binary? Can I do a gsub without a regex easily? (I am just replacing & with &)

you can easily pass a string to gsub string.gsub('&amp', '&') — ipsum
– ipsum, Commented Nov 1, 2010 at 16:20
Doing that just causes the string to become a regex. Same problem — singpolyma
– singpolyma, Commented Nov 1, 2010 at 18:31

Carlos D · Accepted Answer · 2013-06-03 22:44:38Z

4

You need to encode the initial string and use the FIXEDENCODING option.

1.9.3-head :018 > r = Regexp.new("chars".force_encoding("binary"), Regexp::FIXEDENCODING)
=> /chars/
1.9.3-head :019 > r.encoding
=> #<Encoding:ASCII-8BIT>

answered Jun 3, 2013 at 22:44

Carlos D

1801 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 11:56:55Z

0

Strings have a property of encoding. Try to use method String#force_encoding before applying regex.

UPD: To make your regexp be ascii, look on accepted answer here: Ruby 1.9: Regular Expressions with unknown input encoding

def get_regex(pattern, encoding='ASCII', options=0)
  Regexp.new(pattern.encode(encoding),options)
end

edited May 23, 2017 at 11:56

CommunityBot

11 silver badge

answered Nov 1, 2010 at 16:19

Nakilon

35.2k16 gold badges112 silver badges149 bronze badges

3 Comments

singpolyma Over a year ago

Right, I did that. The problem is I can't get the regex to have the same encoding (binary) as the string.

Nakilon Over a year ago

@singpolyma, look on UPD. Is it what you need?

singpolyma Over a year ago

Right, I can get it to be ASCII or UTF-8, but can't get it to be binary/ASCII-8BIT

Collectives™ on Stack Overflow

Ruby 1.9 regex encoding

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related