0

The situation: I have a bunch of text files (.csv, to be precise), around 20000 that differ in character encoding: file -i *.csv gives me charset=us-ascii for most, but some are utf-16le.

The goal: I want them all to be encoded the same way, us-ascii here. I think of a one-liner that checks for each file in the directory the encoding, and if it is utf-16le, it converts it to us-ascii.

I only started to learn bash programming a few day ago, so this one still escapes me. Is it possible, something like running file -i on each file (did that), capturing the return value, check what encoding is given and if it is not us-ascii, convert it?

Thanks for helping me understand how to do that!

3 Answers 3

2

The other solutions don't care about the mixture of files, which sounds like a solution in the sense of:

for F in *.csv; do
    if [ `file -i "$F" | awk '{print $3;}'` = "charset=utf-16" ]; then
        iconv -f UTF-16 -t US-ASCII "$F" > "u.$F"
    fi
done

What makes it easier is the identity of us-ascii and utf-16 in the first few (128) characters - so if the file really is us-ascii, the conversion would not do any harm.

Sign up to request clarification or add additional context in comments.

Comments

1

Pls try the following command:

iconv -f FROM-ENCODING -t TO-ENCODING *.csv

and replace FROM-ENCODING and TO-ENCODING with appropriate values.

You can use the following script, or something similar for your needs.

for file in  *.csv
do
    iconv -f FROM-ENCODING -t TO-ENCODING "$file" > "$file.new"
done

You can also use recode command.

recode FROM-ENCODING..TO-ENCODING file.csv

Finally, look at this Best way to convert text files between character sets? if you are interested in learning more about iconv and/or recode

6 Comments

Parsing the output of ls is harmful, use globbing.
@AdrianFrühwirth Yes, when filenames have spaces, this can be a problem....thanks.
You also need to quote your variables, otherwise it doesn't fix anything ;-)
I only see $file, but if there are others feel free to quote them as well. Always quote, especially when dealing with filenames.
@AdrianFrühwirth Thanks a lot for helping to improve the answer :)
|
1

This will convert any non-us-ascii encoded *.csv files to us-ascii:

#!/bin/bash
for f in *.csv;do
    charset=`file -i README.md |grep -o 'charset=.*'|cut -d= -f2`
    if [ "$charset" != "us-ascii" ];then
      echo "$f $charset -> us-ascii"
      iconv -f "$charset" -t us-ascii < "$f" > "$f.tmp" \
        && mv "$f.tmp" "$f"
    fi
done

1 Comment

Please quote your variables to account for spaces in filenames.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.