Changing encoding in java

Question

I am writting a function that is should detect used charset and then switch it to utf-8. I am using juniversalchardet which is java port for universalchardet by mozilla.
This is my code:

private List<List<String>> setProperEncoding(List<List<String>> input) {
    try {

        // Detect used charset
        UniversalDetector detector = new UniversalDetector(null);

        int position = 0;
        while ((position < input.size()) & (!detector.isDone())) {
            String row = null;
            for (String cell : input.get(position)) {
                row += cell;
            }
            byte[] bytes = row.getBytes();
            detector.handleData(bytes, 0, bytes.length);
            position++;
        }
        detector.dataEnd();

        Charset charset = Charset.forName(detector.getDetectedCharset());
        Charset utf8 = Charset.forName("UTF-8");
        System.out.println("Detected charset: " + charset);

        // rewrite input using proper charset
        List<List<String>> newLines = new ArrayList<List<String>>();
        for (List<String> row : input) {
            List<String> newRow = new ArrayList<String>();
            for (String cell : row) {
                //newRow.add(new String(cell.getBytes(charset)));
                ByteBuffer bb = ByteBuffer.wrap(cell.getBytes(charset));
                CharBuffer cb = charset.decode(bb);
                bb = utf8.encode(cb);
                newRow.add(new String(bb.array()));
            }
            newLines.add(newRow);
        }

        return newLines;

    } catch (Exception e) {
        e.printStackTrace();
        return input;
    }
}

My problem is that when I read file with chars of for example Polish alphabet, letters like ł,ą,ć and similiar are replaced by ? and other strange things. What am I doing wrong?

EDIT: For compilation I am using eclipse.

Method parameter is a result of reading MultipartFile. Just using FileInputStream to get every line and then splitting everyline by some separator (it is prepaired for xls, xlsx and csv files). Nothing special there.

how are you compiling your code ? Eclipse ? command prompt ? Ant ? Maven ? — VirtualTroll
– VirtualTroll, Commented Jul 16, 2013 at 13:28
Once you have the input in Strings, they are already characters, not bytes. — gaborsch
– gaborsch, Commented Jul 16, 2013 at 13:30
What is the source of your input? Show your code for that, please. — gaborsch
– gaborsch, Commented Jul 16, 2013 at 13:31
@GaborSch Do You mean that it is too late for such operation? — Pierwola
– Pierwola, Commented Jul 16, 2013 at 14:10

gaborsch · Accepted Answer · 2013-07-18 11:46:38Z

1

First of all, you have your data somewhere in a binary format. For the sake of simplicity, I suppose it comes from an InputStream.

You want to write the output as an UTF-8 String, I suppose it can be an OutputStream.

I would recommend to create an AutoDetectInputStream:

public class AutoDetectInputStream extends InputStream  {
    private InputStream is;
    private byte[] sampleData = new byte[4096];
    private int sampleLen;
    private int sampleIndex = 0;

    public AutoDetectStream(InputStream is) throws IOException {
        this.is = is;
        // pre-read the data
        sampleLen = is.read(sampleData);
    }

    public Charset getCharset() {
        // detect the charset
        UniversalDetector detector = new UniversalDetector(null);
        detector.handleData(sampleData, 0, sampleLen);
        detector.dataEnd();
        return detector.getDetectedCharset();
    }

    @Override
    public int read() throws IOException {
        // simulate the stream for the reader
        if(sampleIndex < sampleLen) {
            return sampleData[sampleIndex++];
        }
        return is.read();
    }
}

The second task is quite simple because Java stores the strings (characters) in UTF-8, so just use a simple OutputStreamWriter. So, here's your code:

// open input with Detector stream
// we use BufferedReader so we could read lines
InputStream is = new FileInputStream("in.txt");
AutoDetectInputStream detector = new AutoDetectInputStream(is);
Charset charset = detector.getCharset();
// here we can use the charset to decode the bytes into characters
BufferedReader rdr = new BufferedReader(new InputStreamReader(detector, charset));

// open output to write to
OutputStream os = new FileOutputStream("out.txt");
Writer utf8Writer = new OutputStreamWriter(os, Charset.forName("UTF-8"));

// copy the whole file
String line;
while((line = rdr.readLine()) != null) {
    utf8Writer.append(line);
}

// close streams        
rdr.close();
utf8Writer.flush();
utf8Writer.close();

So, finally you got all your txt file transcoded to UTF-8.

Note, that the buffer size should be big enough to feed the UniversalDetector.

edited Jul 18, 2013 at 11:46

answered Jul 16, 2013 at 16:41

gaborsch

15.8k6 gold badges40 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Pierwola Over a year ago

Works perfectly! Thanks! You are the best! Even more- You are the bestest!

gaborsch Over a year ago

@Pierwola :D :D Thank you, I'm always happy to see if I can help others and they appreciate it :)

enxtur Over a year ago

it works but my text converted to "ћонгол ”лсын ≈р?нхийл?гч “улгар т?рийн 2223 жил". Most letters correct, some few letters wrong. Lang is Mongolian. Beleive your response :D

gaborsch Over a year ago

@Enxtur It depends on the UniversalDetector. Please check that your input file is properly encoded, what character set is detected, how many characters did it use to detect the charset. If all are OK, you can try feeding more data, for example, use this.is = new BufferedInputStream(is); -- buffering the input data to make sure you read the most bytes you can.

Collectives™ on Stack Overflow

Changing encoding in java

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related