0

I'm looking for a way to convert an alphanumeric string, e.g. "aBcd3f", into a purely numeric representation, and get the shortest possible input string. The valid characters in the input string are a-z, A-Z, 0-9, and the resultant string would be comprised only of digits 0-9.

Since there are 62 valid values for each character in the input string, I can assign values 00-61 to each input character, and covert the 6 input characters into a 12 character numeric string.

But I would like to get something more compact, if possible - e.g. 8-10 digits. Is it possible, and if so, are there any algorithms or functions for doing this in PHP?

Note that this has to be a 2-way function. I also need to be able to go back from the numeric string to the alphanumeric.

I haven't found this question asked on this site. My question is the opposite of this question, as I'm trying to go in the opposite direction.

6
  • this may help you. Commented Aug 30, 2014 at 22:01
  • <shameless plug>I use github.com/rudiedirkx/Xnary for that.</shameless plug> Commented Aug 30, 2014 at 22:35
  • Are all symbols equally probable to occur in a string? Commented Aug 30, 2014 at 22:38
  • @Rudie: The problem with your lib is that it relies on standard PHP math, which means once you get out of integer range (~2.1 billion for 32-bit PHP) it is no longer reliable. Commented Aug 30, 2014 at 22:41
  • @georg: yes, all symbols are equally probable, and there is no interdependence between the symbols Commented Aug 30, 2014 at 22:58

1 Answer 1

2

A decimal digit encodes log2(10) = 3.32 bits of information on average. Alphanumeric data has 62 possible "digits", so each one encodes log2(62) = 5.95 bits of information on average.

This means that converting from alphanumeric to decimal digits only will require approximately 5.95 / 3.32 = 1.79 times more characters in the output than there are in the input. If your output is constrained to 10 characters maximum you can expect it to encode at most 5.58 characters of alphanumeric input, which for practical purposes means just 5. There is no room for maneuvering here; this is cold math.

The manner of converting from one representation to the other is fairly straightforward, because in essence you are simply converting a number from base 62 to base 10 and back. You can tweak the code from this answer of mine only slightly to achieve the aim.

See it in action.

Note that with the (arbitrary) order of digits I picked the "largest" possible input with 5 characters is "ZZZZZ", which encodes to 9 decimal digits. If you expand the input to 6 characters the largest input would be "ZZZZZZ" which would need 11 decimal digits to encode -- more than the limit we imposed, as predicted.

Also note that this analysis assumes every possible input string is as likely to occur as any other, i.e. the input is perfectly random. If this is not the case then the actual information content of the input would be lower than the theoretical maximum and consequently you could take advantage of this with some kind of compression scheme.

Sign up to request clarification or add additional context in comments.

1 Comment

My bad, sorry. Thought any input should become 8-10 digits.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.