ruby, `match': invalid byte sequence in UTF-8

Question

I have some problem with UTF-8 conding. I have read some posts here but still it does not work properly somehow.

That is my code:

#!/bin/env ruby
#encoding: utf-8

def determine
  file=File.open("/home/lala.txt")          
  file.each do |line|           
    puts(line)
    type = line.match(/DOG/)
    puts('aaaaa')

    if type != nil 
      puts(type[0])
      break
    end        

  end
end

That are the first 3 lines of my file :

;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
text/lalalalala1.0.0.1515
text/lalalala�DOG

When I run this code it shows me an error exactly when reading the third line of the file (where the word dog stands):

;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
aaaaa

text/lalalalala1.0.0.1515
aaaaa

text/lalalala�DOG
/home/kik/Desktop/determine2.rb:16:in `match': invalid byte sequence in UTF-8 (ArgumentError)

BUT: if I run just a a determine function with the following content:

#!/bin/env ruby
#encoding: utf-8

    def determine
    type="text/lalalala�DOG".match(/DOG/)
    puts(type)
end

it works perfectly.

What is going wrong there? Thanks in advance!

EDIT: The third line in the file is:

text/lalalal»DOG

BUT when I print the thirf line of the file in ruby it shows up like:

text/lalalala�DOG

EDIT2:

This format was also developed to support localization. Strings stored within the file are stored as 2 byte UNICODE characters.The format of the file is a binary file with data stored in network byte order (big-endian format).

Are you sure that character is UTF-8? It shows up as unknown for me. What's the code? — Linuxios
– Linuxios, Commented Mar 14, 2013 at 2:22
@Linuxios if I run the code without #encoding :utf-8 I still recieve an error message with "invalid byte sequence in UTF-8" and if I run the code type="text/lalalala�DOG".match(/DOG/) it works — Alina
– Alina, Commented Mar 14, 2013 at 2:28
In your comment that character shows up as invalid for me. If you have invalid UTF-8 sequences, the string is damaged and some methods will generate exceptions like this. — tadman
– tadman, Commented Mar 14, 2013 at 2:33
If you can determine the encoding of the file, you can open it the correct way. It might be ISO-1252 or ISO-8859-1. If I put » in a file, in UTF-8 it encodes to bytes [197, 187], not what you got. What you have is probably invalid. — tadman
– tadman, Commented Mar 14, 2013 at 4:02

Darshan Rivka Whittle · Accepted Answer · 2013-03-14 18:21:46Z

3

I believe @Amadan is close, but has it backwards. I'd do this:

File.open("/home/lala.txt", "r:ASCII-8BIT")

The character is not valid UTF-8, but for your purposes, it looks like 8-bit ASCII will work fine. My understanding is that Ruby is using that encoding by default when you just use the string, which is why that works.

Update: Based on your most recent comment, it sounds like this is what you need:

File.open("/home/lala.txt", "rb:UTF-16BE")

edited Mar 14, 2013 at 18:21

answered Mar 14, 2013 at 3:14

Darshan Rivka Whittle

34.2k7 gold badges97 silver badges114 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Alina Over a year ago

It works, I mean I do not receive an error message BUT type = line.match(/DOG/) does not work. It does not find a word DOG in a file.

Alina Over a year ago

did not work :5:in initialize': ASCII incompatible encoding needs binmode (ArgumentError) from /home/kik/Desktop/determine2.rb:5:in open'

Darshan Rivka Whittle Over a year ago

@Katja: I think that means you need a b in there; I just updated my answer to reflect this.

Alina Over a year ago

@DarshanComputing error again: `match': invalid byte sequence in UTF-16BE (ArgumentError)

Darshan Rivka Whittle Over a year ago

Well, then your information is wrong; it's not big-endian UTF-16 after all. Maybe it's a bug in the code generating the file.

Amadan · Accepted Answer · 2013-03-14 02:52:44Z

1

Try using this:

File.open("/home/lala.txt", "r:UTF-8")

There seems to be an issue with wrong encoding being used at some stage. #encoding :utf specifies only the encoding of the source file, which affects how the literal string is interpreted, and has no effect on the encoding that File.open uses.

answered Mar 14, 2013 at 2:52

Amadan

200k23 gold badges252 silver badges321 bronze badges

Comments

Taimoor Changaiz · Accepted Answer · 2013-05-20 13:26:49Z

-1

Simple Solution for less number of files:

@Katja open the file in some text editor and click on save as option and change its format to UTF-8 and click OK. pop up will be displayed to replace or create new. Replace existing file and you are on.

answered May 20, 2013 at 13:26

Taimoor Changaiz

10.7k4 gold badges52 silver badges55 bronze badges

Collectives™ on Stack Overflow

ruby, `match': invalid byte sequence in UTF-8

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related