1


I am writting a function that is should detect used charset and then switch it to utf-8. I am using juniversalchardet which is java port for universalchardet by mozilla.
This is my code:

private List<List<String>> setProperEncoding(List<List<String>> input) {
    try {

        // Detect used charset
        UniversalDetector detector = new UniversalDetector(null);

        int position = 0;
        while ((position < input.size()) & (!detector.isDone())) {
            String row = null;
            for (String cell : input.get(position)) {
                row += cell;
            }
            byte[] bytes = row.getBytes();
            detector.handleData(bytes, 0, bytes.length);
            position++;
        }
        detector.dataEnd();

        Charset charset = Charset.forName(detector.getDetectedCharset());
        Charset utf8 = Charset.forName("UTF-8");
        System.out.println("Detected charset: " + charset);

        // rewrite input using proper charset
        List<List<String>> newLines = new ArrayList<List<String>>();
        for (List<String> row : input) {
            List<String> newRow = new ArrayList<String>();
            for (String cell : row) {
                //newRow.add(new String(cell.getBytes(charset)));
                ByteBuffer bb = ByteBuffer.wrap(cell.getBytes(charset));
                CharBuffer cb = charset.decode(bb);
                bb = utf8.encode(cb);
                newRow.add(new String(bb.array()));
            }
            newLines.add(newRow);
        }

        return newLines;

    } catch (Exception e) {
        e.printStackTrace();
        return input;
    }
}

My problem is that when I read file with chars of for example Polish alphabet, letters like ł,ą,ć and similiar are replaced by ? and other strange things. What am I doing wrong?

EDIT: For compilation I am using eclipse.

Method parameter is a result of reading MultipartFile. Just using FileInputStream to get every line and then splitting everyline by some separator (it is prepaired for xls, xlsx and csv files). Nothing special there.

4
  • how are you compiling your code ? Eclipse ? command prompt ? Ant ? Maven ? Commented Jul 16, 2013 at 13:28
  • Once you have the input in Strings, they are already characters, not bytes. Commented Jul 16, 2013 at 13:30
  • What is the source of your input? Show your code for that, please. Commented Jul 16, 2013 at 13:31
  • @GaborSch Do You mean that it is too late for such operation? Commented Jul 16, 2013 at 14:10

1 Answer 1

1

First of all, you have your data somewhere in a binary format. For the sake of simplicity, I suppose it comes from an InputStream.

You want to write the output as an UTF-8 String, I suppose it can be an OutputStream.

I would recommend to create an AutoDetectInputStream:

public class AutoDetectInputStream extends InputStream  {
    private InputStream is;
    private byte[] sampleData = new byte[4096];
    private int sampleLen;
    private int sampleIndex = 0;

    public AutoDetectStream(InputStream is) throws IOException {
        this.is = is;
        // pre-read the data
        sampleLen = is.read(sampleData);
    }

    public Charset getCharset() {
        // detect the charset
        UniversalDetector detector = new UniversalDetector(null);
        detector.handleData(sampleData, 0, sampleLen);
        detector.dataEnd();
        return detector.getDetectedCharset();
    }

    @Override
    public int read() throws IOException {
        // simulate the stream for the reader
        if(sampleIndex < sampleLen) {
            return sampleData[sampleIndex++];
        }
        return is.read();
    }
}

The second task is quite simple because Java stores the strings (characters) in UTF-8, so just use a simple OutputStreamWriter. So, here's your code:

// open input with Detector stream
// we use BufferedReader so we could read lines
InputStream is = new FileInputStream("in.txt");
AutoDetectInputStream detector = new AutoDetectInputStream(is);
Charset charset = detector.getCharset();
// here we can use the charset to decode the bytes into characters
BufferedReader rdr = new BufferedReader(new InputStreamReader(detector, charset));

// open output to write to
OutputStream os = new FileOutputStream("out.txt");
Writer utf8Writer = new OutputStreamWriter(os, Charset.forName("UTF-8"));

// copy the whole file
String line;
while((line = rdr.readLine()) != null) {
    utf8Writer.append(line);
}

// close streams        
rdr.close();
utf8Writer.flush();
utf8Writer.close();

So, finally you got all your txt file transcoded to UTF-8.

Note, that the buffer size should be big enough to feed the UniversalDetector.

Sign up to request clarification or add additional context in comments.

4 Comments

Works perfectly! Thanks! You are the best! Even more- You are the bestest!
@Pierwola :D :D Thank you, I'm always happy to see if I can help others and they appreciate it :)
it works but my text converted to "ћонгол ”лсын ≈р?нхийл?гч “улгар т?рийн 2223 жил". Most letters correct, some few letters wrong. Lang is Mongolian. Beleive your response :D
@Enxtur It depends on the UniversalDetector. Please check that your input file is properly encoded, what character set is detected, how many characters did it use to detect the charset. If all are OK, you can try feeding more data, for example, use this.is = new BufferedInputStream(is); -- buffering the input data to make sure you read the most bytes you can.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.