1

What I want to solve

The following error occurs when downloading that file, compressed into a single zip file.

invalid byte sequence in UTF-8.

For this error, I have to remove illegal characters as UTF-8 from the string, so I used encode method to convert from UTF-8 to UTF-8, but the string I want to display is not displayed. It looks like the image.

file_name.encode!("UTF-8", "UTF-8", invalid: :replace)

enter image description here

Is there any solution to this problem?

I would be glad to know.

source code

        Zip::File.open_buffer(obj) do |zip|

          zip.each do |entry|
            ext = File.extname(entry.name)
            file_name = File.basename(entry.name)

            # file_name.encode!("UTF-8", "UTF-8", invalid: :replace)

            next if ext.blank? || file_name.count(".") > 1

            dir = File.join(dir_name, File.dirname(entry.name))

            FileUtils.mkpath(dir.to_s)      

            zip.extract(entry, dir + ".txt" || ".jpg" || ".csv") {true}

            file_name.force_encoding("UTF-8")
            new_file_name = "#{dir_name}/#{file_name}"

            new_file_name.force_encoding("UTF-8")
            File.rename(dir + ".txt" || ".jpg" || ".csv", new_file_name)

            @input_dir << new_file_name
          end
        end
        
        Zip::OutputStream.open(zip_file.path) do |zip_data|
          @input_dir.each do |file|
          zip_data.put_next_entry(file)
          zip_data.write(File.read(file.to_s))
          end
        end

environment

mac OS Catarina 10.15.7 ruby "2.6.3"

7
  • Can you show file_name.codepoints along with the expected result? Commented Dec 9, 2020 at 9:25
  • @Stefan This is the file_name.codepoints. [87, 78, 83, 95, 85, 80, 65533, 112, 65533, 102, 65533, 91, 65533, 94, 46, 116, 120, 116] However, to show this code point, we revived the code in (1). Is this correct? (1)file_name.encode!("UTF-8", "UTF-8", invalid: :replace) Commented Dec 9, 2020 at 13:09
  • 65533 is the replacement character, i.e. �. It seems like you ran the code after the conversion? Sorry for not being clear. Please run entry.name.codepoints and also entry.name.encoding and post their output. Commented Dec 9, 2020 at 14:47
  • @Stefan entry.name.codepoints could not be displayed with the following error. invalid byte sequence in UTF-8. Just tried outputting file_name.codepoints with the expected file name of the same name. [87, 78, 83, 95, 85, 80, 29992, 12486, 12441, 12540, 12479, 46, 116, 120, 116] entry.name.encoding is UTF-8. Commented Dec 9, 2020 at 21:23
  • Try entry.name.bytes then. Commented Dec 9, 2020 at 21:31

1 Answer 1

2

You get these errors because the Zip gem assumes the filenames to be encoded in UTF-8 but they are actually in a different encoding.

To fix the error, you first have to find the correct encoding. Let's re-create the string from its bytes:

bytes = [111, 117, 116, 112, 117, 116, 50, 48, 50, 48, 49,
         50, 48, 55, 95, 49, 52, 49, 54, 48, 50, 47, 87,
         78, 83, 95, 85, 80, 151, 112, 131, 102, 129, 91,
         131, 94, 46, 116, 120, 116]

string = bytes.pack('c*')
#=> "output20201207_141602/WNS_UP\x97p\x83f\x81[\x83^.txt"

We can now traverse the Encoding.list and select those that return the expected result:

Encoding.list.select do |enc|
  s = string.encode('UTF-8', enc) rescue next
  s.end_with?('WNS_UP用データ.txt')
end
#=> [
#     #<Encoding:Windows-31J>,
#     #<Encoding:Shift_JIS>,
#     #<Encoding:SJIS-DoCoMo>,
#     #<Encoding:SJIS-KDDI>,
#     #<Encoding:SJIS-SoftBank>
#   ]

All of the above encodings result in the correct output.

Back to your code, you could use:

path = entry.name.encode('UTF-8', 'Windows-31J')
#=> "output20201207_141602/WNS_UP用データ.txt"

ext = File.extname(path)
#=> ".txt"

file_name = File.basename(path)
#=> "WNS_UP用データ.txt"

The Zip gem also has an option to set an explicit encoding for non-ASCII file names. You might want to give it a try by setting Zip.force_entry_names_encoding = 'Windows-31J' (haven't tried it)

Sign up to request clarification or add additional context in comments.

2 Comments

I was impressed. This solution solved the problem. It also worked this way. I had never imagined this way Zip.force_entry_names_encoding = 'Windows-31J'. I'm a newbie, so I'll post more questions. I really appreciate your help in resolving the problem.
@taizo you’re welcome. If this was helpful and solved your problem, you might want to upvote it and tick the green checkmark.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.