3

I have used this answer to "manually" convert from unicode to UTF-8 code units. The problem is that I need the resulting UTF-8 to be contained in a byte array. How can I do that by using shifting operations whenever possible to go from hexadecimal to uft-8?

The code I already have is the following:

 public static void main(String[] args)
   throws UnsupportedEncodingException, CharacterCodingException {

   String st = "ñ";

   for (int i = 0; i < st.length(); i++) {
      int unicode = st.charAt(i);
      codepointToUTF8(unicode);
   }
 }

 public static byte[] codepointToUTF8(int codepoint) {
    byte[] hb = codepointToHexa(codepoint);
    byte[] binaryUtf8 = null;

    if (codepoint <= 0x7F) {
      binaryUtf8 = parseRange(hb, 8);
    } else if (codepoint <= 0x7FF) {
      binaryUtf8 = parseRange(hb, 16);
    } else if (codepoint <= 0xFFFF) {
      binaryUtf8 = parseRange(hb, 24);
    } else if (codepoint <= 0x1FFFFF) {
      binaryUtf8 = parseRange(hb, 32);
    }

    byte[] utf8Codeunits = new byte[hexStr.length()];
    for (int i = 0; i < hexStr.length(); i++) {
      utf8Codeunits[i] = (byte) hexStr.charAt(i);
      System.out.println(utf8Codeunits[i]); // prints 99 51 98 49,
      // which is the same as c3b1, the UTF-8 for ñ
    }

    return binaryUtf8;
  }


  public static byte[] codepointToHexa(int codepoint) {
    int n = codepoint;
    int m;

    List<Byte> list = new ArrayList<>();
    while (n >= 16) {
      m = n % 16;
      n = n / 16;
      list.add((byte) m);
    }
    list.add((byte) n);
    byte[] bytes = new byte[list.size()];
    for (int i = list.size() - 1; i >= 0; i--) {
      bytes[list.size() - i - 1] = list.get(i);
    }

    return bytes;
  }

  private static byte[] parseRange(byte[] hb, int length) {

    byte[] binarybyte = new byte[length];
    boolean[] filled = new boolean[length];

    int index = 0;
    if (length == 8) {
      binarybyte[0] = 0;
      filled[0] = true;
    } else {
      int cont = 0;
      while (cont < length / 8) {
        filled[index] = true;
        binarybyte[index++] = 1;
        cont++;
      }
      binarybyte[index] = 0;
      filled[index] = true;
      index = 8;
      while (index < length) {
        filled[index] = true;
        binarybyte[index++] = 1;
        binarybyte[index] = 0;
        filled[index] = true;
        index += 7;
      }
    }

    byte[] hbbinary = convertHexaArrayToBinaryArray(hb);
    int hbindex = hbbinary.length - 1;

    for (int i = length - 1; i >= 0; i--) {
      if (!filled[i] && hbindex >= 0) {
        // we fill it and advance the iterator
        binarybyte[i] = hbbinary[hbindex];
        hbindex--;
        filled[i] = true;
      } else if (!filled[i]) {
        binarybyte[i] = 0;
        filled[i] = true;
      }
    }
    return binarybyte;
  }

 private static byte[] convertHexaArrayToBinaryArray(byte[] hb) {

    byte[] binaryArray = new byte[hb.length * 4];
    String aux = "";
    for (int i = 0; i < hb.length; i++) {

      aux = Integer.toBinaryString(hb[i]);
      int length = aux.length();
      // toBinaryString doesn't return a 4 bit string, so we fill it with 0s
      // if length is not a multiple of 4
      while (length % 4 != 0) {
        length++;
        aux = "0" + aux;
      }

      for (int j = 0; j < aux.length(); j++) {
        binaryArray[i * 4 + j] = (byte) (aux.charAt(j) - '0');
      }
    }
  
    return binaryArray;
  }

I don't know how to handle bytes properly, so I'm aware that the things I did are probably wrong.

5
  • Is this homework? You can verify the results using String.getBytes("UTF-8"). And Wikipedia will show the bit patterns 10xxxxxx and such. Masking and shifting being no magic. Commented Jul 13, 2016 at 10:49
  • No, it's not homework. I need the converter for a personal project and I want it to be efficient. I know the bit patterns, since they are in the link I quoted. But I have no idea what to shift (or when to do it) with what I have to get the desired result. Commented Jul 13, 2016 at 11:00
  • ... yeah, its homework. Wanting to be more "efficient" than proven, tested and readily available JRE methods is kinda ... redundant and smells extremely like exercise. Probably a college exam - IT students get told that reinventing the wheel is the cool thing to do, nowadays ... which is horrible for their career but it does wonders for useless in-depth knowledge of implementation details. Commented Jul 13, 2016 at 11:47
  • I finished college 2 years ago. You're both wrong, I'm sorry. I just want to learn how things work by coding them myself. Otherwise, I would not learn anything about encodings, just use the existing libraries. But it seems like trying to learn is now been regarded as cheating. Commented Jul 13, 2016 at 12:02
  • no, the word "cheating" doesnt even make sense contextually - reinventing the wheel literally serves not a single purpose, there is no knowledge to be gained, only experience in useless topics. If you want to acquire knowledge you will need to read papers, test out reference implementations, assemble large groups of libraries and use them properly or start a career in IT research. The wheel wont teach you anything, it just does its job -- and you wont be able to understand the physics behind wheels just because you disassembled one or two but you could start inventing a rectangular wheel Commented Jul 13, 2016 at 12:05

2 Answers 2

4

UTF-8 fills Unicode code points as follows:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
... (max 6 bytes)

Where the right most bit is the least significant one for the number.

static byte[] utf8(IntStream codePoints) {
    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    final byte[] cpBytes = new byte[6]; // IndexOutOfBounds for too large code points
    codePoints.forEach((cp) -> {
        if (cp < 0) {
            throw new IllegalStateException("No negative code point allowed");
        } else if (cp < 0x80) {
            baos.write(cp);
        } else {
            int bi = 0;
            int lastPrefix = 0xC0;
            int lastMask = 0x1F;
            for (;;) {
                int b = 0x80 | (cp & 0x3F);
                cpBytes[bi] = (byte)b;
                ++bi;
                cp >>= 6;
                if ((cp & ~lastMask) == 0) {
                    cpBytes[bi] = (byte) (lastPrefix | cp);
                    ++bi;
                    break;
                }
                lastPrefix = 0x80 | (lastPrefix >> 1);
                lastMask >>= 1;
            }
            while (bi > 0) {
                --bi;
                baos.write(cpBytes[bi]);
            }
        }
    });
    return baos.toByteArray();
}

Except for the 7 bits ASCII the encoding can be done in a loop.

Sign up to request clarification or add additional context in comments.

4 Comments

So basically, until the last iteration, we make sure to use only the last 6 bits of the codepoint by making the & with 0x3F, then change the first bit to 1 to make the prefix 10 and remove those 6 bits by shifting to the right. In the last iteration, we do the same with the last prefix, which changes from 11000000 to 11100000 to 11110000... in every iteration to make sure we're using the appropriate prefix. Very useful, thank you!
Yes in the multibyte sequence all continuation bytes are 01xxxxxx.
Note that standard UTF-8 can technically use up to 6 bytes to encode codepoints up to U+7FFFFFFF, but legally can only use up to 4 bytes (Java's Modified UTF-8 can go up to 6 bytes). RFC 3629 restricts the highest codepage that UTF-8 can legally handle to U+10FFFF, which is the highest codepoint that UTF-16 can physically encode, and is the highest codepoint that Unicode currently defines.
@RemyLebeau very fine comment, Note that I exclude negative ints, "U+80000000" upwards, reserve only 6 bytes. And another rule is that the shortest byte sequence should be used, as I do above by the loop condition. Another modification of Java's Modified UTF-8 concerns '\u0000' is strings. As C/C++ has problems with such byte arrays as C string, it als encodes this char: 0xC0, 0x80. In DataOutputStream.
0

Here's my take on this:

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.util.stream.IntStream;

public final class StackOverflow_38349372 {

    public static byte[] encodeTo_UTF_8_bytes(final IntStream codePoints) {

        final ByteArrayOutputStream baos = new ByteArrayOutputStream();

        codePoints.forEach(codePoint -> {
            try {
                baos.write(encodeTo_UTF_8_bytes(codePoint));
            }
            catch (                IOException e) {
                throw new UncheckedIOException(e);
            }
        });
        return  baos.toByteArray();
    }
    private static byte[] encodeTo_UTF_8_bytes(int codePoint) {
        /*
         * See sun.nio.cs.UTF_8 for Legal UTF-8 Byte Sequences.
         */
        if (codePoint < 0) {
            throw new IllegalStateException("Negative Codepoints are not allowed");
        }
        if (codePoint < 0x80) {
            return new byte[] {(byte) codePoint}; // 1-Byte Codepoints are simple & MUST be excluded here anyway.
        }
        final int    bitCount            = Integer.SIZE - Integer.numberOfLeadingZeros(codePoint);
        final int    utf8byteCount       = (bitCount + 3) / 5;        // Yields incorrect result for 1-Byte Codepoints (which we excluded, above)
        final int    utf8firstBytePrefix = 0x3F_00 >>> utf8byteCount; // 2 to 6 1-bits right-shifted into Low-Order Byte, depending on Byte-Count.

        final byte[] utf8bytes           = new byte[utf8byteCount];

        for (int i=utf8byteCount - 1; i >= 0; i--) { // (fill the Byte Array from right to left)

            if (i == 0) {
                utf8bytes[i] = (byte) (utf8firstBytePrefix | (0x3F  &  codePoint)); // First-Byte Prefix + trailing 6 bits
            } else {
                utf8bytes[i] = (byte) (0x80                | (0x3F  &  codePoint)); // Other-Byte Prefix + trailing 6 bits
            }
            codePoint >>>= 6;  // Shift right to ready the next 6 bits (or, for 1st byte, as many as remain)
        }
        return  utf8bytes;
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.