Java BASE64 utf8 string decoding

Question

I'm using org.apache.commons.codec.binary.Base64 do decode string which is utf8. Sometimes I get base64 encoded string which after decode looks like for example ^@kďż˝ďż˝@@. How can I check if base64 is correct or if decoded utf8 string is valid utf8 string?

To clarify. I'm using

public static String base64Decode(String str) {
    try {
        return new String(base64Decode(str.getBytes(Constants.UTF_8)), Constants.UTF_8);
    } catch (UnsupportedEncodingException e) {
         ...
    }
}

public static byte[] base64Decode(byte[] byteArray) {
    return Base64.decodeBase64(byteArray);
}

What do you mean be a String is "UTF-8"? A String object doesn't know about encodings and charsets. — Michael Konietzka
– Michael Konietzka, Commented Jan 17, 2011 at 17:46
@Michael Konietzka: I think that is unnecessary nitpicking. Base64 encodes a sequence of bytes. I think the OP is clearly saying that the byte sequence is assumed to be the UTF-8 encoding of a unicode string not that a java.lang.String is directly encoded as Base64 (which as you say would not make sense.) — finnw
– finnw, Commented Jan 17, 2011 at 18:33
@finnw sorry I dont know how to explain clearly. I get encoded string using base64 and I want to check if it is correct. I want to catch situation when I get base64 encoded string which after decoding looks like trash, because everything I received should be some for example name. — terry207
– terry207, Commented Jan 18, 2011 at 7:56
Maybe I just have to check is base64 dont contain any space and other dont allowed chars? — terry207
– terry207, Commented Jan 18, 2011 at 10:22

BalusC · Accepted Answer · 2011-01-17 15:04:32Z

32

You should specify the charset during converting String to byte[] and vice versa.

byte[] bytes = string.getBytes("UTF-8");
// feed bytes to Base64

and

// get bytes from Base64
String string = new String(bytes, "UTF-8");

Otherwise the platform default encoding will be used which is not necessarily UTF-8 per se.

answered Jan 17, 2011 at 15:04

BalusC

1.1m377 gold badges3.7k silver badges3.6k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

finnw Over a year ago

That string does not look like UTF8 misinterpreted as a single-byte encoding. Could it be GB18030 misinterpreted as UTF8?

BalusC Over a year ago

@finnw: The answer indeed assumes that the original string is UTF-8, as explicitly mentioned by the OP. If this is actually not the case, then the problem is to be solved somewhere else.

Michael Konietzka Over a year ago

@BalusC: What do you mean by a String is UTF8? UTF-8 is an encoding.

BalusC Over a year ago

@Michael: the string must have been constructed somehow. For example, if you create the string based on data returned by a Reader, you need to ensure as well that the Reader is reading the source using UTF-8. I however understand your nitpick, I should probably have worded my previous comment better, e.g. "source" instead of "string".

Michael Konietzka Over a year ago

i.e "国家标准" is neither UTF-8 nor GB18030, it is just a String object. But it can be encoded with UTF-8, GB18030, because these encodings can encode all unicode code points. Of course, the decoding system must use the same character encoding on the bytes as the encoding system. Yes, I am nit-pick on this issue, because in the question "a string is utf-8" was mentioned, which needs clarification, because there is no such thing as a "UTF-8 String". You can encode a String into a byte array using UTF-8, but then there is just a byte[].

|

atiruz · Accepted Answer · 2013-08-29 20:00:26Z

Try this:

var B64 = {
    alphabet: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=',
    lookup: null,
    ie: /MSIE /.test(navigator.userAgent),
    ieo: /MSIE [67]/.test(navigator.userAgent),
    encode: function (s) {
        var buffer = B64.toUtf8(s),
            position = -1,
            len = buffer.length,
            nan1, nan2, enc = [, , , ];
        if (B64.ie) {
            var result = [];
            while (++position < len) {
                nan1 = buffer[position + 1], nan2 = buffer[position + 2];
                enc[0] = buffer[position] >> 2;
                enc[1] = ((buffer[position] & 3) << 4) | (buffer[++position] >> 4);
                if (isNaN(nan1)) enc[2] = enc[3] = 64;
                else {
                    enc[2] = ((buffer[position] & 15) << 2) | (buffer[++position] >> 6);
                    enc[3] = (isNaN(nan2)) ? 64 : buffer[position] & 63;
                }
                result.push(B64.alphabet[enc[0]], B64.alphabet[enc[1]], B64.alphabet[enc[2]], B64.alphabet[enc[3]]);
            }
            return result.join('');
        } else {
            result = '';
            while (++position < len) {
                nan1 = buffer[position + 1], nan2 = buffer[position + 2];
                enc[0] = buffer[position] >> 2;
                enc[1] = ((buffer[position] & 3) << 4) | (buffer[++position] >> 4);
                if (isNaN(nan1)) enc[2] = enc[3] = 64;
                else {
                    enc[2] = ((buffer[position] & 15) << 2) | (buffer[++position] >> 6);
                    enc[3] = (isNaN(nan2)) ? 64 : buffer[position] & 63;
                }
                result += B64.alphabet[enc[0]] + B64.alphabet[enc[1]] + B64.alphabet[enc[2]] + B64.alphabet[enc[3]];
            }
            return result;
        }
    },
    decode: function (s) {
        var buffer = B64.fromUtf8(s),
            position = 0,
            len = buffer.length;
        if (B64.ieo) {
            result = [];
            while (position < len) {
                if (buffer[position] < 128) result.push(String.fromCharCode(buffer[position++]));
                else if (buffer[position] > 191 && buffer[position] < 224) result.push(String.fromCharCode(((buffer[position++] & 31) << 6) | (buffer[position++] & 63)));
                else result.push(String.fromCharCode(((buffer[position++] & 15) << 12) | ((buffer[position++] & 63) << 6) | (buffer[position++] & 63)));
            }
            return result.join('');
        } else {
            result = '';
            while (position < len) {
                if (buffer[position] < 128) result += String.fromCharCode(buffer[position++]);
                else if (buffer[position] > 191 && buffer[position] < 224) result += String.fromCharCode(((buffer[position++] & 31) << 6) | (buffer[position++] & 63));
                else result += String.fromCharCode(((buffer[position++] & 15) << 12) | ((buffer[position++] & 63) << 6) | (buffer[position++] & 63));
            }
            return result;
        }
    },
    toUtf8: function (s) {
        var position = -1,
            len = s.length,
            chr, buffer = [];
        if (/^[\x00-\x7f]*$/.test(s)) while (++position < len)
        buffer.push(s.charCodeAt(position));
        else while (++position < len) {
            chr = s.charCodeAt(position);
            if (chr < 128) buffer.push(chr);
            else if (chr < 2048) buffer.push((chr >> 6) | 192, (chr & 63) | 128);
            else buffer.push((chr >> 12) | 224, ((chr >> 6) & 63) | 128, (chr & 63) | 128);
        }
        return buffer;
    },
    fromUtf8: function (s) {
        var position = -1,
            len, buffer = [],
            enc = [, , , ];
        if (!B64.lookup) {
            len = B64.alphabet.length;
            B64.lookup = {};
            while (++position < len)
            B64.lookup[B64.alphabet[position]] = position;
            position = -1;
        }
        len = s.length;
        while (position < len) {
            enc[0] = B64.lookup[s.charAt(++position)];
            enc[1] = B64.lookup[s.charAt(++position)];
            buffer.push((enc[0] << 2) | (enc[1] >> 4));
            enc[2] = B64.lookup[s.charAt(++position)];
            if (enc[2] == 64) break;
            buffer.push(((enc[1] & 15) << 4) | (enc[2] >> 2));
            enc[3] = B64.lookup[s.charAt(++position)];
            if (enc[3] == 64) break;
            buffer.push(((enc[2] & 3) << 6) | enc[3]);
        }
        return buffer;
    }
};

View Here

This one worked perfectly for me. I understand it got a negative vote because its a javascript answer on a java question.

bheatcoker · Accepted Answer · 2016-04-14 11:29:50Z

0

I created this method:

public static String descodificarDeBase64(String stringCondificado){
    try {
        return new String(Base64.decode(stringCondificado.getBytes("UTF-8"),Base64.DEFAULT));
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
        return "";
    }
}

So I can decode from Base64 spanish characthers as á,ñ,í,ü.

Example:

descodificarDeBase64("wr9xdcOpIHRhbD8=");

will return: ¿Qué tal?

answered Apr 14, 2016 at 11:29

bheatcoker

5495 silver badges10 bronze badges

1 Comment

Philip Rego Over a year ago

Base64.DEFAULT is undefined

Collectives™ on Stack Overflow

Java BASE64 utf8 string decoding

3 Answers 3

6 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related