LInux shell: conditional conversion of character encoding, multiple text files

Question

The situation: I have a bunch of text files (.csv, to be precise), around 20000 that differ in character encoding: file -i *.csv gives me charset=us-ascii for most, but some are utf-16le.

The goal: I want them all to be encoded the same way, us-ascii here. I think of a one-liner that checks for each file in the directory the encoding, and if it is utf-16le, it converts it to us-ascii.

I only started to learn bash programming a few day ago, so this one still escapes me. Is it possible, something like running file -i on each file (did that), capturing the return value, check what encoding is given and if it is not us-ascii, convert it?

Thanks for helping me understand how to do that!

flaschenpost · Accepted Answer · 2013-05-13 06:18:22Z

2

The other solutions don't care about the mixture of files, which sounds like a solution in the sense of:

for F in *.csv; do
    if [ `file -i "$F" | awk '{print $3;}'` = "charset=utf-16" ]; then
        iconv -f UTF-16 -t US-ASCII "$F" > "u.$F"
    fi
done

What makes it easier is the identity of us-ascii and utf-16 in the first few (128) characters - so if the file really is us-ascii, the conversion would not do any harm.

edited May 13, 2013 at 6:18

answered May 12, 2013 at 21:15

flaschenpost

2,2352 gold badges15 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:15:13Z

1

Pls try the following command:

iconv -f FROM-ENCODING -t TO-ENCODING *.csv

and replace FROM-ENCODING and TO-ENCODING with appropriate values.

You can use the following script, or something similar for your needs.

for file in  *.csv
do
    iconv -f FROM-ENCODING -t TO-ENCODING "$file" > "$file.new"
done

You can also use recode command.

recode FROM-ENCODING..TO-ENCODING file.csv

Finally, look at this Best way to convert text files between character sets? if you are interested in learning more about iconv and/or recode

edited May 23, 2017 at 12:15

CommunityBot

11 silver badge

answered May 12, 2013 at 21:03

Bill

5,9427 gold badges37 silver badges50 bronze badges

6 Comments

Adrian Frühwirth Over a year ago

Parsing the output of ls is harmful, use globbing.

Bill Over a year ago

@AdrianFrühwirth Yes, when filenames have spaces, this can be a problem....thanks.

Adrian Frühwirth Over a year ago

You also need to quote your variables, otherwise it doesn't fix anything ;-)

Adrian Frühwirth Over a year ago

I only see $file, but if there are others feel free to quote them as well. Always quote, especially when dealing with filenames.

Bill Over a year ago

@AdrianFrühwirth Thanks a lot for helping to improve the answer :)

|

rzymek · Accepted Answer · 2013-05-13 07:28:27Z

1

This will convert any non-us-ascii encoded *.csv files to us-ascii:

#!/bin/bash
for f in *.csv;do
    charset=`file -i README.md |grep -o 'charset=.*'|cut -d= -f2`
    if [ "$charset" != "us-ascii" ];then
      echo "$f $charset -> us-ascii"
      iconv -f "$charset" -t us-ascii < "$f" > "$f.tmp" \
        && mv "$f.tmp" "$f"
    fi
done

edited May 13, 2013 at 7:28

answered May 12, 2013 at 21:17

rzymek

9,3712 gold badges51 silver badges60 bronze badges

1 Comment

Adrian Frühwirth Over a year ago

Please quote your variables to account for spaces in filenames.

Collectives™ on Stack Overflow

LInux shell: conditional conversion of character encoding, multiple text files

3 Answers 3

Comments

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related