Floating-point numbers
Floating-point numbers
- To store reals of very different sizes, computers use floating-point — binary scientific notation.
- It has two parts, and understanding them explains some famous bugs.
- Let's see the format, normalisation, and why
0.1 + 0.2 ≠ 0.3.
The format
- A floating-point number has a mantissa (the significant digits) and an exponent (the power of 2). Both are stored as two's complement:
- The mantissa is a fixed-point fraction (binary point after the sign bit).
- Worked: mantissa
0.1010000$= \tfrac12 + \tfrac18 = 0.625$; with exponent2, the value is $0.625 \times 2^2 = 2.5$.
A floating-point number is stored as:
value = mantissa × 2^exponent — binary scientific notation, with both parts stored in two's complement.
The mantissa of a floating-point number holds:
The mantissa is the significant digits; the exponent is the power of 2 to scale by.
Normalisation
- A number is normalised when the first significant bit sits immediately after the binary point (no wasted leading zeros).
- This maximises precision — every mantissa bit then carries information.
- To normalise, shift the mantissa and adjust the exponent until the leading bit is in place; the value is unchanged.
Normalising a floating-point number:
Normalisation shifts the mantissa so the first significant bit follows the point — every bit then carries information, and the value is unchanged.
Rounding errors
- Many denary reals can't be stored exactly in binary —
0.1is a repeating binary fraction, so it's truncated. - Consequences: rounding errors build up (
0.1 + 0.2≠ exactly0.3); never testx = 0.3— useABS(x - 0.3) < 1e-9; subtracting nearly-equal values loses precision. - For exact needs (currency), use fixed-point or BCD instead.
Why does 0.1 + 0.2 not give exactly 0.3 on a computer?
0.1 is a repeating binary fraction, so it is approximated — the small errors add up.
For exact money calculations you should use:
Floating-point rounding errors are unacceptable for currency; fixed-point or BCD store decimal values exactly.
You've got it
- floating-point: $\text{value} = \text{mantissa} \times 2^{\text{exponent}}$ (both two's complement)
- mantissa = significant digits; exponent = power of 2
- normalisation puts the first significant bit just after the point → maximum precision
- reals can't always be stored exactly → rounding errors; compare with a tolerance; use fixed-point/BCD for currency