23

I have the following code, which gives me an invalid byte sequence error pointing to the scan method in initialize. Any ideas on how to fix this? For what it's worth, the error does not occur when the (.*) between the h1 tag and the closing > is not there.

#!/usr/bin/env ruby

class NewsParser

  def initialize
      Dir.glob("./**/index.htm") do |file|
        @file = IO.read file 
        parsed = @file.scan(/<h1(.*)>(.*?)<\/h1>(.*)<!-- InstanceEndEditable -->/im)
        self.write(parsed)
      end
  end

  def write output
    @contents = output
    open('output.txt', 'a') do |f| 
      f << @contents[0][0]+"\n\n"+@contents[0][1]+"\n\n\n\n" 
    end
  end

end

p = NewsParser.new

Edit: Here is the error message:

news_parser.rb:10:in 'scan': invalid byte sequence in UTF-8 (ArgumentError)

SOLVED: The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) and encoding: UTF-8 solve the issue.

Thanks!

8
  • 1
    try @file = IO.read(file).encode("utf-8", replace: nil) Commented Mar 7, 2012 at 19:11
  • Nope, I get the same error message. Commented Mar 7, 2012 at 19:12
  • It looks like the html file is Western (ISO-8859-1) Commented Mar 7, 2012 at 19:16
  • possible duplicate of ruby 1.9: invalid byte sequence in UTF-8 Commented Mar 7, 2012 at 19:17
  • 7
    @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) ? Commented Mar 7, 2012 at 19:21

2 Answers 2

41

The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) and #encoding: UTF-8 solved the issue.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the answer. An explanation of what it means would be helpful nonetheless.
1

While this question already has an accepted answer, I found it while having the same problem with a different style of opening the file:

File.open(file_name).each_with_index do |line, index|
  line.gsub!(/[{}]/, "'")
  puts "#{index} #{line}"
end

I found that my input file was encoded in ISO-8859-1, so I changed it to the following to avoid the error:

File.open(file_name, 'r:ISO-8859-1:utf-8').each_with_index do |line, index|
  line.gsub!(/[{}]/, "'")
  puts "#{index} #{line}"
end

See the documentation for the optional mode argument of the File.open method for more details.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.