Implementing a Character Encoding in Java

Question

I was asked this question during an interview with a famous IT company. They asked me to suggest how a character encoding will be implemented if we have lots of characters & 16 bits of Unicode are not enough. I answered we can implement 64 bit encoding for characters. They said, even it's not enough, to which I suggested to implement a encoding via java BigInteger.

Then they asked the encoding should be such that it only takes the bits that are needed. Like ASCII representation of A is 01000001 , we should not be using the leading 0 because we don't need it and we are wasting memory. I could not give an answer to it. If you could please tell me about how to approach this problem and how it is handled.

You could study how it's handled in the numerous ways of encoding Unicode. — Kayaman
– Kayaman, Commented Aug 3, 2017 at 10:19
Maybe interesting to you: stackoverflow.com/questions/5290182/… — user180100
– user180100, Commented Aug 3, 2017 at 10:31
How is a 64-bit encoding "not enough" when the highest codepoint defined in Unicode fits in 21 bits? Even 16 bits is enough, when used in pairs. — Remy Lebeau
– Remy Lebeau, Commented Aug 5, 2017 at 22:03

Jonathan Rosenne · Accepted Answer · 2017-08-03 10:32:51Z

1

See the Unicode Standard, Chapter 3: "The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. The size of the code unit is specified for each encoding form. This section presents the formal definition of each of these encoding forms."

As regards the question on saving bits, this is meaningful only when the text is very large, in which case I would suggest using compression, such as zip. There are solutions in various languages that let you read from and write to a compressed file directly.

answered Aug 3, 2017 at 10:32

Jonathan Rosenne

2,23718 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Kayaman Over a year ago

It was an interview question which he failed to answer.

Jonathan Rosenne Over a year ago

I have met several experienced programmers who would find themselves in a similar position, so I thought it would be worthwhile to provide my views.

Kayaman Over a year ago

Are you sure they were experienced programmers and not just saying that. Certain people have a tendency to embellish their experience and knowledge.

Jonathan Rosenne Over a year ago

Yes. Character encoding is not a simple matter, especially when the language you wish to encode is not English. It has become simpler with the wide spread use of Unicode, but there are still quite a lot of legacy encodings in common use.

Kayaman Over a year ago

Encoding is indeed not a simple thing (I keep answering questions about it every few weeks). However suggesting BigInteger when 64 bits isn't enough means he failed to understand what the question was all about. I would expect experienced developers to have some understanding of character encoding, even if we nowadays are pretty much on UTF-8 everywhere.

Collectives™ on Stack Overflow

Implementing a Character Encoding in Java

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related