Module int_to_float

Source
Expand description

Conversions from integers to floats.

The algorithm is explained here: https://blog.m-ou.se/floats/. It roughly does the following:

  • Calculate a base mantissa by shifting the integer into mantissa position. This gives us a mantissa with the implicit bit set!
  • Figure out if rounding needs to occur by classifying the bits that are to be truncated. Some patterns are used to simplify this. Adjust the mantissa with the result if needed.
  • Calculate the exponent based on the base-2 logarithm of i (leading zeros). Subtract one.
  • Shift the exponent and add the mantissa to create the final representation. Subtracting one from the exponent (above) accounts for the explicit bit being set in the mantissa.

ยงTerminology

  • i: the original integer
  • i_m: the integer, shifted fully left (no leading zeros)
  • n: number of leading zeroes
  • e: the resulting exponent. Usually 1 is subtracted to offset the mantissa implicit bit.
  • m_base: the mantissa before adjusting for truncated bits. Implicit bit is usually set.
  • adj: the bits that will be truncated, possibly compressed in some way.
  • m: the resulting mantissa. Implicit bit is usually set.

Functionsยง

exp ๐Ÿ”’
Calculate the exponent from the number of leading zeros.
m_adj ๐Ÿ”’
Adjust a mantissa with dropped bits to perform correct rounding.
repr ๐Ÿ”’
Shift the exponent to its position and add the mantissa.
shift_f_gt_i ๐Ÿ”’
Shift distance from an integer with n leading zeros to a smaller float.
shift_f_lt_i ๐Ÿ”’
Shift distance from a left-aligned integer to a smaller float.
signed
Perform a signed operation as unsigned, then add the sign back.
u32_to_f32_bits
u32_to_f64_bits
u32_to_f128_bits
u64_to_f32_bits
u64_to_f64_bits
u64_to_f128_bits
u128_to_f32_bits
u128_to_f64_bits
u128_to_f128_bits