6

I have some problem with UTF-8 conding. I have read some posts here but still it does not work properly somehow.

That is my code:

#!/bin/env ruby
#encoding: utf-8

def determine
  file=File.open("/home/lala.txt")          
  file.each do |line|           
    puts(line)
    type = line.match(/DOG/)
    puts('aaaaa')

    if type != nil 
      puts(type[0])
      break
    end        

  end
end

That are the first 3 lines of my file :

;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
text/lalalalala1.0.0.1515
text/lalalala�DOG

When I run this code it shows me an error exactly when reading the third line of the file (where the word dog stands):

;?lalalalal60000065535-1362490443-0000006334-0000018467-0000000041en-lalalalallalalalalalalalaln Cell Generation
aaaaa

text/lalalalala1.0.0.1515
aaaaa

text/lalalala�DOG
/home/kik/Desktop/determine2.rb:16:in `match': invalid byte sequence in UTF-8 (ArgumentError)

BUT: if I run just a a determine function with the following content:

#!/bin/env ruby
#encoding: utf-8

    def determine
    type="text/lalalala�DOG".match(/DOG/)
    puts(type)
end

it works perfectly.

What is going wrong there? Thanks in advance!

EDIT: The third line in the file is:

text/lalalal»DOG

BUT when I print the thirf line of the file in ruby it shows up like:

text/lalalala�DOG

EDIT2:

This format was also developed to support localization. Strings stored within the file are stored as 2 byte UNICODE characters.The format of the file is a binary file with data stored in network byte order (big-endian format).

13
  • Are you sure that character is UTF-8? It shows up as unknown for me. What's the code? Commented Mar 14, 2013 at 2:22
  • @Linuxios if I run the code without #encoding :utf-8 I still recieve an error message with "invalid byte sequence in UTF-8" and if I run the code type="text/lalalala�DOG".match(/DOG/) it works Commented Mar 14, 2013 at 2:28
  • What's the character code? Commented Mar 14, 2013 at 2:33
  • 1
    In your comment that character shows up as invalid for me. If you have invalid UTF-8 sequences, the string is damaged and some methods will generate exceptions like this. Commented Mar 14, 2013 at 2:33
  • 1
    If you can determine the encoding of the file, you can open it the correct way. It might be ISO-1252 or ISO-8859-1. If I put » in a file, in UTF-8 it encodes to bytes [197, 187], not what you got. What you have is probably invalid. Commented Mar 14, 2013 at 4:02

3 Answers 3

3

I believe @Amadan is close, but has it backwards. I'd do this:

File.open("/home/lala.txt", "r:ASCII-8BIT")

The character is not valid UTF-8, but for your purposes, it looks like 8-bit ASCII will work fine. My understanding is that Ruby is using that encoding by default when you just use the string, which is why that works.

Update: Based on your most recent comment, it sounds like this is what you need:

File.open("/home/lala.txt", "rb:UTF-16BE")
Sign up to request clarification or add additional context in comments.

5 Comments

It works, I mean I do not receive an error message BUT type = line.match(/DOG/) does not work. It does not find a word DOG in a file.
did not work :5:in initialize': ASCII incompatible encoding needs binmode (ArgumentError) from /home/kik/Desktop/determine2.rb:5:in open'
@Katja: I think that means you need a b in there; I just updated my answer to reflect this.
@DarshanComputing error again: `match': invalid byte sequence in UTF-16BE (ArgumentError)
Well, then your information is wrong; it's not big-endian UTF-16 after all. Maybe it's a bug in the code generating the file.
1

Try using this:

File.open("/home/lala.txt", "r:UTF-8")

There seems to be an issue with wrong encoding being used at some stage. #encoding :utf specifies only the encoding of the source file, which affects how the literal string is interpreted, and has no effect on the encoding that File.open uses.

Comments

-1

Simple Solution for less number of files:

@Katja open the file in some text editor and click on save as option and change its format to UTF-8 and click OK. pop up will be displayed to replace or create new. Replace existing file and you are on.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.