3

I was asked this question during an interview with a famous IT company. They asked me to suggest how a character encoding will be implemented if we have lots of characters & 16 bits of Unicode are not enough. I answered we can implement 64 bit encoding for characters. They said, even it's not enough, to which I suggested to implement a encoding via java BigInteger.

Then they asked the encoding should be such that it only takes the bits that are needed. Like ASCII representation of A is 01000001 , we should not be using the leading 0 because we don't need it and we are wasting memory. I could not give an answer to it. If you could please tell me about how to approach this problem and how it is handled.

3
  • 1
    You could study how it's handled in the numerous ways of encoding Unicode. Commented Aug 3, 2017 at 10:19
  • 1
    Maybe interesting to you: stackoverflow.com/questions/5290182/… Commented Aug 3, 2017 at 10:31
  • How is a 64-bit encoding "not enough" when the highest codepoint defined in Unicode fits in 21 bits? Even 16 bits is enough, when used in pairs. Commented Aug 5, 2017 at 22:03

1 Answer 1

1

See the Unicode Standard, Chapter 3: "The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. The size of the code unit is specified for each encoding form. This section presents the formal definition of each of these encoding forms."

As regards the question on saving bits, this is meaningful only when the text is very large, in which case I would suggest using compression, such as zip. There are solutions in various languages that let you read from and write to a compressed file directly.

Sign up to request clarification or add additional context in comments.

5 Comments

It was an interview question which he failed to answer.
I have met several experienced programmers who would find themselves in a similar position, so I thought it would be worthwhile to provide my views.
Are you sure they were experienced programmers and not just saying that. Certain people have a tendency to embellish their experience and knowledge.
Yes. Character encoding is not a simple matter, especially when the language you wish to encode is not English. It has become simpler with the wide spread use of Unicode, but there are still quite a lot of legacy encodings in common use.
Encoding is indeed not a simple thing (I keep answering questions about it every few weeks). However suggesting BigInteger when 64 bits isn't enough means he failed to understand what the question was all about. I would expect experienced developers to have some understanding of character encoding, even if we nowadays are pretty much on UTF-8 everywhere.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.