How can fix UTF-8 string usage in bash?

Question

I have a bash script what contains several utf-8 string contained variables. These variables are used as parameters of a bash function in the sctript, what calls a cp and a python script with this parameters.

This script runs properly on my machine, but can not work on another one. I tried to debug with set -x and other stuffs, but I can not find the root cause, only this difference.

There is a minimalized example - like Plunker for JS ;)

I have the following test.sh

#!/bin/bash
set -x

function aaa() {
    echo "$1"
}
echo 'öüóőúéáűíÖÜÓŐÚÉÁŰÍ'
aaa 'öüóőúéáűíÖÜÓŐÚÉÁŰÍ'

I copy to my two hosts

The good shows the following:

+ echo öüóőúéáűíÖÜÓŐÚÉÁŰÍ
öüóőúéáűíÖÜÓŐÚÉÁŰÍ
+ aaa öüóőúéáűíÖÜÓŐÚÉÁŰÍ
+ echo öüóőúéáűíÖÜÓŐÚÉÁŰÍ
öüóőúéáűíÖÜÓŐÚÉÁŰÍ

However the bad shows this:

+ echo $'\303\266\303\274\303\263\305\221\303\272\303\251\303\241\305\261\303\255\303\226\303\234\303\223\305\220\303\232\303\211\303\201\305\260\303\215'
öüóőúéáűíÖÜÓŐÚÉÁŰÍ
+ aaa $'\303\266\303\274\303\263\305\221\303\272\303\251\303\241\305\261\303\255\303\226\303\234\303\223\305\220\303\232\303\211\303\201\305\260\303\215'
+ echo $'\303\266\303\274\303\263\305\221\303\272\303\251\303\241\305\261\303\255\303\226\303\234\303\223\305\220\303\232\303\211\303\201\305\260\303\215'
öüóőúéáűíÖÜÓŐÚÉÁŰÍ

Here is some details for debugging:

The good working machine is a Ubuntu Trusty with bash=4.2-2ubuntu2.6, and the bad working machine is a Ubuntu Precise with bash=4.3-7ubuntu1.5.

The locales are identical in both machines:

$ locale                                                                                                                                                                                                                                                           
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
LC_MESSAGES=POSIX
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=

Updates

I was wrong with the cp, sorry.

I thought, the python exception is not related in this case, because the it was broken in bash. This backtrace can help anything?

+ /tmp/callrecord-renamer.py --skip --contacts $'/var/datastore/T\303\274nci/Rendszer/DropboxClone/contacts.ini' $'/var/datastore/T\303\274nci/DropboxClone/H\303\215V\303\201SFELV\303\211TELEK'
Traceback (most recent call last):
  File "/tmp/callrecord-renamer.py", line 316, in <module>
    main()
  File "/tmp/callrecord-renamer.py", line 312, in main
    FileManager(args.recording_path, contacts_path, args.no_change, args.skip_errors).update_files_in_directory()
  File "/tmp/callrecord-renamer.py", line 87, in update_files_in_directory
    self.contacts.load()
  File "/tmp/callrecord-renamer.py", line 56, in load
    self.database.read(self.file_path)
  File "/usr/lib/python3.2/configparser.py", line 689, in read
    self._read(fp, filename)
  File "/usr/lib/python3.2/configparser.py", line 994, in _read
    for lineno, line in enumerate(fp, start=1):
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 3176: invalid start byte

For more details, you can check this file on: https://github.com/andras-tim/callrecord-renamer/blob/master/callrecord-renamer.py

Update2

I have checked: this error caused independently from bash code. The .ini file encoding was bad... Sorry for all debugger helpers!

I'm not sure that you actually have a problem. The output is correct in both cases; you are just getting a different (but valid) representation in the debugging output on the "bad" host. — chepner
– chepner, Commented Oct 22, 2015 at 17:22
I have found this article stackoverflow.com/questions/11838597/… - but can't solve my problem... :( — andras.tim
– andras.tim, Commented Oct 22, 2015 at 17:23
@chepner the cp can not find the source path, however this is existing. — andras.tim
– andras.tim, Commented Oct 22, 2015 at 17:25
If you are having a problem with cp then show us the problem with cp and not some other problem entirely. — Etan Reisner
– Etan Reisner, Commented Oct 22, 2015 at 17:29
This doesn't appear to be a shell issue, but a problem with cp on the bad host in dealing with a UTF-8 encoded string. The bad host is just showing the raw UTF-8 stream, rather than displaying the encoded Unicode characters. The data is the same on both machines (\303\266, for example, is in octal. The two bytes are 0xC3 and 0xB6, which is the UTF-8 encoding for U+00F6, ö. — chepner
– chepner, Commented Oct 22, 2015 at 17:32

that other guy · Accepted Answer · 2015-10-22 17:33:08Z

2

You are comparing the xtrace debugging output of set -x. You can not and should not expect bash's xtrace output to be in a certain format. If you want a specific format, you need to produce it yourself.

If you look at the non-debug output your script, it's identical on both machines.

answered Oct 22, 2015 at 17:33

that other guy

125k12 gold badges187 silver badges214 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

andras.tim Over a year ago

I'm debugging bash codes several years ago, but I can't see this escaping from set -x until now. - I'm blaming the encoding, because now, I have checked on a third machine (this was Precise too), where this script was worked.

Collectives™ on Stack Overflow

How can fix UTF-8 string usage in bash?

There is a minimalized example - like Plunker for JS ;)

Here is some details for debugging:

Updates

Update2

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

There is a minimalized example - like Plunker for JS ;)

Here is some details for debugging:

Updates

Update2

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related