let's talk about floating-point numbers. here's a view into the mind of your computer, specifically the part that handles 32-bit floating-point numbers:
inspired by float toy
so what's a "32-bit floating-point number"? well, there are 32 numbered checkboxes: these are the 32 bits of your float32 number. a bit is a "binary digit" that can be one or zero; a checkmark means one, and the lack thereof means zero. these are the "ones and zeroes inside your computer" you might have heard about.
the magenta row has 8 bits and the cyan row has 23 bits. these contain unsigned integers. the way you "read" these integers is similar to how you read normal (decimal) numbers, except we're in binary/base-2 ("we have 2 digits") instead of decimal/base-10 ("we have 10 digits"). and something you already know, maybe without thinking about it, is how to count: to find the next number (x+1) you add one to the right-most digit; if it's already the highest possible digit (9 in decimal, 1 in binary), it "overflows", and you have to also add 1 to its "neighbor to the left", which itself can overflow, and so on. examples in decimal are 9+1=10 and 99+1=100. in binary we have 1+1=10 and 11+1=100 (binary 10 is decimal 2, and binary 100 is decimal 4). try the increment/decrement arrows in the exponent/significand number inputs above and observe how the bits "react".
informally speaking, a floating-point number is just "a bunch of digits" plus "a position for the point" plus "whether it's positive or negative". in decimal you could consider a bunch of digits like:
1234and then decide to place the point somewhere:
12.34and maybe even "flip the sign bit":
-12.34floating-point numbers even lets you place the point "outside" of the digits:
. 1234 or: 0.0001234and
1234 . or: 1234000.0so we generally have something like:
real_value = normalized_value * (base**point_position) or for base-10 and base-2 respectively: real_value = normalized_value * (10**point_position) real_value = normalized_value * (2**point_position)which is also somewhat related to "scientific notation" where you have a value and a multiplier (the base to the power of an exponent):
1234 = 1.234 * (10**3) 1234000 = 1.234 * (10**6) 0.0001234 = 1.234 * (10**-4)
and let's briefly interrupt the flow and cover the different notations for exponentiation:sorry for the mess! xy is "math"; x^y is used sometimes but we avoid it here since it's the XOR operator in C/JS; x**y is a syntax found in JS and Python. exp2(x) is the typical math library function name for 2x, eg. C has exp2()/exp2f(), Python has math.exp2(). JS doesn't have Math.exp2(), but you can do 2**x, Math.pow(2,x) and Math.exp(x*Math.log(2)) instead.
- 2x, 2^x, 2**x and exp2(x) are the same
- xy, x^y, x**y and pow(x,y) are the same
"normalized" means "designed to fit within a range". for instance, in scientific notation, you typically use values between 1.0 and 9.999… as the "normalized" value, and let the exponent do the rest.
but enough of this "general theory". lets stick to binary and explore how float32s really work (we're gonna get low-level):
sign * (1 + significand*(2**-23)) * (2 ** (exp-127))notice how the significand represents a normalized number between 1 and 2. another way i've heard this explained: imagine there's a 24th bit in the significand that's always 1.
sign * (significand*(2**-23)) * (2 ** -126)notice how the significand represents a number between 0 and 1.
now we know (nearly) everything about how float32 numbers work! but there's a practical problem: how do we even access the 32 bits? these are usually "hidden" from you because the "floating-point semantics" are the entire reason for using them in the first place; to be able to write 1.1*42.42 and let your hardware deal with the "underlying magic". so how do we "crack this safe open"?
it's done differently in JS and C. JS first:
function u32_to_f32(u) {
const buf = new ArrayBuffer(4); // get 4 bytes (32 bits) of memory
const f32 = new Float32Array(buf); // float32 view into buf
const u32 = new Uint32Array(buf); // uint32 view into buf
u32[0] = u; // set the memory as u32...
return f32[0]; // ...and read back as f32
}
function f32_to_u32(f) {
const buf = new ArrayBuffer(4); // get 4 bytes (32 bits) of memory
const f32 = new Float32Array(buf); // float32 view into buf
const u32 = new Uint32Array(buf); // uint32 view into buf
f32[0] = f; // set the memory as f32...
return u32[0]; // ...and read back as u32
}
in C we do this:
static inline float u32_to_f32(unsigned u) {
union { float f; unsigned u; } xx;
// the "union" defines xx as a value that can be
// "float" (f32) and "unsigned" (u32) at the same time
xx.u = u; // set memory as u32...
return xx.f; // ...and read back as f32
}
static inline unsigned f32_to_u32(float f) {
union { float f; unsigned u; } xx;
xx.f = f; // set memory as f32...
return xx.u; // ...and read back as u32
}
but let's get back to plotting. something that hadn't occurred to me before i started diving into all this is that floating-point numbers are actually "neatly ordered", at least if they're positive, so that if a value x is larger than a value y, then f32_to_u32(x) is also larger than f32_to_u32(y). let us explore what that means by simply interpolating between two large integers and converting it back with u32_to_f32().
i cheated a bit since we already have a framework for comparing functions, and chose a lerp (linear interpolation) that matches 2**x (aka exp2(x)). in an earlier post we wanted to replace exp2(x) and now it looks like we're nearly there, almost without doing any work?! max error seems to be about .09 around x=0. not bad for doing "no work".
let's see what the exponent looks like alone, without any significand bits. we reuse the previous code, but add a line that "masks out" the 23 least significant bits, which is another way of saying, "setting them to zero". bit-hackery was promised:
we won't go too deep into bit-hackery, but:
x = x & ~((1<<23)-1);first we have a power-of-two:
1<<23that minus one gives us a value with all of the lowest 23 bits set:
(1<<23)-1the ~/NOT operator inverts these bits, making every bit except the lower 23 bits set:
~((1<<23)-1)the &/AND operator does a bit-by-bit comparison; the per-bit output is only 1 if both inputs are 1. ie. it clears the lower 23 bits:
x & ~((1<<23)-1)
now why is all this useful? sure, we almost solved the exp2(x) problem, but how do we improve it? well, look at the "almost accidental" exp2(x) approximations above; all we need is to somehow make up for the difference between good(x) and bad(x).
lets look at the lerp-hack version (without masking) of exp2(x). trying to convert the bit-hackery back into math, we get:
(2**Math.floor(x)) * (1 + (x-Math.floor(x)))go ahead and try replacing good(x) with this expression to see that they match pretty well (max error of about 1e-7 around x=0). this also lets us define what the error is:
err = (2**x) - (2**Math.floor(x)) * (1 + (x-Math.floor(x)))
why are we back in math land? well, recall we need to be able to differentiate a function in order to construct a taylor series for it. bit hackery can't be differentiated, but the above definition can. well, almost. it uses floor(x) which has discontinuities, and taylor series don't like them.
but we can fix that. have a look at how the exp2 function works.
exp2(x+1) = exp2(x)*2or generally:
exp2(x+k) = exp2(x)*exp2(k)in english it says: "going k steps to the right on the curve gives a result exp2(k) times larger than before regardless of where you are on the curve". the "regardless" part is important because it means that exp2(x) is "fractal"; it "repeats itself everywhere".
what it really means is that you only need to approximate exp2(x) between the discontinuities to reconstruct the rest. try this in the stairstep version above; change the return statement to:
return u32_to_f32(xu) * (2**(x-Math.floor(x)))now the error completely disappears! of course we can't have self-referential solutions like this, but it tells us what the solution looks like: we "just" have to approximate exp2(x) between 0 and 1 (recall that taylor series are most accurate around the "anchor point" of your choice)
we can also fix the smoother version by replacing the return statement with two lines:
const xf = x-Math.floor(x); return u32_to_f32(xu) * ((2**xf)/(1+xf))
now depending on which on which version we want to fix, we only need to be able to differentiate one of these functions for x between 0 and 1:
f(x) = 2**xor
f(x) = (2**x) / (1+x)
the stairstep one certaintly looks easier. being a little rusty in calculus i asked maxima, which is a "computer algebra system" that knows how to differentiate. the simple 2**x one gives us:
(%i14) diff(2^x,x,0);
x
(%o14) 2
(%i15) diff(2^x,x,1);
x
(%o15) 2 log(2)
(%i16) diff(2^x,x,2);
x 2
(%o16) 2 log (2)
(%i17) diff(2^x,x,3);
x 3
(%o17) 2 log (2)
the maxima syntax is diff(expr,var,order). expr is the
expression to differentiate (note that maxima uses 2^x syntax, not
2**x). var is the input variable (x for
f(x)). order is the order of the derivative, 0 being 0th
order (itself), 1 being 1st derivative, etc. the more complicated
(2**x) / (1+x) gives us:
(%i18) diff((2^x)/(1+x),x,0);
x
2
(%o18) βββββ
x + 1
(%i19) diff((2^x)/(1+x),x,1);
x x
2 log(2) 2
(%o19) βββββββββ - ββββββββ
x + 1 2
(x + 1)
(%i20) diff((2^x)/(1+x),x,2);
x 2 x + 1 x + 1
2 log (2) 2 log(2) 2
(%o20) ββββββββββ - βββββββββββββ + ββββββββ
x + 1 2 3
(x + 1) (x + 1)
(%i21) diff((2^x)/(1+x),x,3);
x 3 x + 1 2 x 2 x + 2 x + 1
2 log (2) 2 log (2) 2 log (2) 2 log(2) 2 log(2)
(%o21) ββββββββββ - ββββββββββββββ - ββββββββββ + βββββββββββββ + βββββββββββββ
x + 1 2 2 3 3
(x + 1) (x + 1) (x + 1) (x + 1)
x + 1
3 2
- ββββββββ
4
(x + 1)
uh-oh, something's exploding. it's not necessarily a "real problem" because
this math would still only be used to pre-calculate polynomial
coefficients (constants at run-time). but you need to somehow translate the
above "differentiation pattern" into JS code in order to make a tool that
supports arbitrary a and order values (like the taylor
series toy from last time). so the simple one it is! here's everything put
together:
we get nasty discontinuities at x=0,1,2... but just like last time we can try "auto-calibrating" it. although, since a=0.5 seems to work better this time the method is slightly different: insert this at the bottom:
print(1/taylor(0)) print(2/taylor(1))and grab the results. for order=3, a=0.5, i get:
1.000795077867169 1.0004565267924628and then fix the output by lerping:
const c0 = 1.000795077867169; const c1 = 1.0004565267924628; return sum * (c0 + (c1-c0)*x);max error isn't better, but the discontinuities are gone.
combining everything, order=4, auto-calibrated:
the C version of this code is about 18% faster than the math.h exp2f(x). not bad!
also don't worry that "max error" seems to go up as x goes up. this is more a problem with our tool. with the previous sin(x) stuff it made sense to calculate error as abs(good(x)-bad(x)), because sin(x) always stays around yΒ±1, but exp2(x) just keeps rising. good(x)/bad(x) would probably be a better starting point. you can also try find x=20 and see that max error is around 40, which sounds like a lot, but when you see that the y-values are around 1 million, then an error of 40 isn't a lot.