3

I am parsing this feed http://www.sixapart.com/labs/update/developers/ with nokogiri and then running some regex on the contents of some tags. The content is UTF-8 mostly, but is occasionally corrupt. However, for my case I don't really care and just need to pass the right parts of the content through, so I'm happy to treat the data as binary/ASCII-8BIT. The problem is that no matter what I do, regexes in my script are treated as either UTF-8 or ASCII. No matter what I set the encoding comment to, or what I do to create the regex.

Is there a solution to this? Can I force the regex to binary? Can I do a gsub without a regex easily? (I am just replacing & with &)

2
  • you can easily pass a string to gsub string.gsub('&amp', '&') Commented Nov 1, 2010 at 16:20
  • Doing that just causes the string to become a regex. Same problem Commented Nov 1, 2010 at 18:31

2 Answers 2

4

You need to encode the initial string and use the FIXEDENCODING option.

1.9.3-head :018 > r = Regexp.new("chars".force_encoding("binary"), Regexp::FIXEDENCODING)
=> /chars/
1.9.3-head :019 > r.encoding
=> #<Encoding:ASCII-8BIT>
Sign up to request clarification or add additional context in comments.

Comments

0

Strings have a property of encoding. Try to use method String#force_encoding before applying regex.

UPD: To make your regexp be ascii, look on accepted answer here: Ruby 1.9: Regular Expressions with unknown input encoding

def get_regex(pattern, encoding='ASCII', options=0)
  Regexp.new(pattern.encode(encoding),options)
end

3 Comments

Right, I did that. The problem is I can't get the regex to have the same encoding (binary) as the string.
@singpolyma, look on UPD. Is it what you need?
Right, I can get it to be ASCII or UTF-8, but can't get it to be binary/ASCII-8BIT

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.