Expand description
Floating point division routines.
This module documentation gives an overview of the method used. More documentation is inline.
§Relevant notation
m_a
: the mantissa ofa
, in base 2p_a
: the exponent ofa
, in base 2. I.e.a = m_a * 2^p_a
uqN
(e.g.uq1
): this refers to Q notation for fixed-point numbers. UQ1.31 is an unsigned fixed-point number with 1 integral bit, and 31 decimal bits. AuqN
variable of typeuM
will have N bits of integer and M-N bits of fraction.hw
: half width, i.e. forf64
this will be au32
.x
is the best estimate of1/m_b
§Method Overview
Division routines must solve for a / b
, which is res = m_a*2^p_a / m_b*2^p_b
. The basic
process is as follows:
- Rearange the exponent and significand to simplify the operations:
res = (m_a / m_b) * 2^{p_a - p_b}
. - Check for early exits (infinity, zero, etc).
- If
a
orb
are subnormal, normalize by shifting the mantissa and adjusting the exponent. - Set the implicit bit so math is correct.
- Shift mantissa significant digits (with implicit bit) fully left such that fixed-point UQ1 or UQ0 numbers can be used for mantissa math. These will have greater precision than the actual mantissa, which is important for correct rounding.
- Calculate the reciprocal of
m_b
,x
. - Use the reciprocal to multiply rather than divide:
res = m_a * x_b * 2^{p_a - p_b}
. - Reapply rounding.
§Reciprocal calculation
Calculating the reciprocal is the most complicated part of this process. It uses the Newton-Raphson method, which picks an initial estimation (of the reciprocal) and performs a number of iterations to increase its precision.
In general, Newton’s method takes the following form:
`x_n` is a guess or the result of a previous iteration. Increasing `n` converges to the
desired result.
The result approaches a zero of `f(x)` by applying a correction to the previous gues.
x_{n+1} = x_n - f(x_n) / f'(x_n)
Applying this to find the reciprocal:
1 / x = b
Rearrange so we can solve by finding a zero
0 = (1 / x) - b = f(x)
f'(x) = -x^{-2}
x_{n+1} = 2*x_n - b*x_n^2
This is a process that can be repeated to calculate the reciprocal with enough precision to achieve a correctly rounded result for the overall division operation. The maximum required number of iterations is known since precision doubles with each iteration.
§Half-width operations
Calculating the reciprocal requires widening multiplication and performing arithmetic on the
results, meaning that emulated integer arithmetic on u128
(for f64
) and u256
(for f128
)
gets used instead of native math.
To make this more efficient, all but the final operation can be computed using half-width
integers. For example, rather than computing four iterations using 128-bit integers for f64
,
we can instead perform three iterations using native 64-bit integers and only one final
iteration using the full 128 bits.
This works because of precision doubling. Some leeway is allowed here because the fixed-point number has more bits than the final mantissa will.
Modules§
- __
divdf3 🔒 - __
divdf3vfp 🔒 - __
divsf3 🔒 - __
divsf3vfp 🔒 - __
divtf3 🔒
Functions§
- __
divdf3 - __
divsf3 - __
divtf3 - c_hw 🔒
- The value of
C
adjusted to half width. - div 🔒
- get_
iterations 🔒 - Calculate the number of iterations required for a float type’s precision.
- next_
guess 🔒 - Perform one iteration at any width to approach
1/b
, given previous guessx
. Returns the nextx
as a UQ0 number. - reciprocal_
precision 🔒 u_n
for different precisions (with N-1 half-width iterations).