Binary Encoding for Arbitrarily Large Integers
(You can watch the video version of this post on YouTube.)
Question: How do you efficiently encode an arbitrarily large integer using bits? And to make things interesting, let’s say I can’t see how long your encoding is: I just start reading and that must tell me when I’m done reading; otherwise you could just write it in binary and we’d be done. Being able to see when to stop reading is actually a pretty good property to have, because then the representation composes well.
We could say a number is really just a string of bits, but if I reserve zero to terminate the string I only have 1
left for the content, so that doesn’t really work out so well: we end up write the number in unary. But it does work.
5 = 111110
14 = 111111111111110
1024 = 11111111111111111111111111111(...)
If we want some more symbols to work with, we could use 2 bits per digit to encode the number in ternary – the fourth option being the terminator. With 00
$=0_3$, 01
$=1_3$, , 10
$=2_3$, and 11
as terminator, we would have the following.
5 = 01 10 11
14 = 01 01 10 11
1024 = 01 01 00 01 11 11 01 11
But let’s do it a little bit differently. We’ll alternate a “keep-going” bit and a content bit, so to write the number $n$, that takes 2 bits per actual bit and up to a constant factor that’s optimal just from a pigeon hole argument.
5 = 11 10 01
14 = 11 11 11 00
1024 = 11 10 10 10 10 10 10 10 10 10 00
But look at this: if we shuffle the bits around a … bit … we can put the second bit of each pair together, which gives the binary encoding of $n$. In front of that we can put the first bit of each pair, which – if you think about it – encodes the length of the binary encoding in unary! (Actually the length minus one, with a terminating zero, but that works out quite well, because even the number zero requires one bit.)
$ \text{\tt[6 bits] }\quad\quad\quad 5\quad= \overbrace{{\tt 110}}^\text{length (unary)}\quad \overbrace{\tt 101}^{n \text{ (binary)}} $
$ \text{\tt[8 bits] }\quad\quad\quad 14\quad=\quad {\tt 111}_1{\tt 0} \quad {\tt 1110}_2 $
$ \text{\tt[22 bits]}\quad\quad\quad 1024\quad=\quad {\tt 1111111111}_1{\tt 0} \quad {\tt 10000000000}_2 $
If we’re going to try to save some constant number of bits, we could for example observe that the first digit of the binary part is always 1
(otherwise we should have written a smaller length), so we don’t actually need to write it.1 But wait a minute, why are we still writing the length in unary? Let’s repeat our idea!
In unary, write how many bits it takes to describe the length of the actual number, then write that length in binary, then write the actual number in binary. This adds up to $\log(n) + 2\log\log(n)$ bits, which is smaller than $2\log(n)$ if $n$ is large enough.
$ \text{\tt[7 bits] }\quad\quad\quad 5\quad= \overbrace{{\tt 10}}^\text{length of length (unary)}\quad \overbrace{{\tt 11}}^\text{length (binary)}\quad \overbrace{\tt 101}^{n \text{ (binary)}} $
$ \text{\tt[10 bits]}\quad\quad\quad 14\quad=\quad {\tt 11}_1{\tt0}\quad {\tt 100}_2\quad {\tt 1110}_2 $
$ \text{\tt[19 bits]}\quad\quad\quad 1024\quad=\quad {\tt 111}_1{\tt 0}\quad {\tt 1011}_2\quad {\tt 10000000000}_2 $
You see where this is going, of course, because we’re still writing the length of the length in unary, but … how that works out, you can look up if you ever need to.
-
Since we actually write “length minus one” instead of the length, you need a special case to handle zero. ↩︎