In: Computer Science
Urgent Please
Explain and show the difference between IEEE 16, 32, 64, 128-bit floating-point numbers.
4. Floating-Point Number Representation
A floating-point number (or real number) can represent a very
large (1.23×10^88
) or a very small
(1.23×10^-88
) value. It could also represent very
large negative number (-1.23×10^88
) and very small
negative number (-1.23×10^88
), as well as zero, as
illustrated:
A floating-point number is typically expressed in the scientific
notation, with a fraction (F
), and an
exponent (E
) of a certain radix
(r
), in the form of F×r^E
. Decimal
numbers use radix of 10 (F×10^E
); while binary numbers
use radix of 2 (F×2^E
).
Representation of floating point number is not unique. For
example, the number 55.66
can be represented as
5.566×10^1
, 0.5566×10^2
,
0.05566×10^3
, and so on. The fractional part can be
normalized. In the normalized form, there is only a single
non-zero digit before the radix point. For example, decimal number
123.4567
can be normalized as
1.234567×10^2
; binary number 1010.1011B
can be normalized as 1.0101011B×2^3
.
It is important to note that floating-point numbers suffer from
loss of precision when represented with a fixed number of
bits (e.g., 32-bit or 64-bit). This is because there are
infinite number of real numbers (even within a small range
of says 0.0 to 0.1). On the other hand, a n-bit binary
pattern can represent a finite 2^n
distinct numbers. Hence, not all the real numbers can be
represented. The nearest approximation will be used instead,
resulted in loss of accuracy.
It is also important to note that floating number arithmetic is very much less efficient than integer arithmetic. It could be speed up with a so-called dedicated floating-point co-processor. Hence, use integers if your application does not require floating-point numbers.
In computers, floating-point numbers are represented in
scientific notation of fraction (F
) and
exponent (E
) with a radix of 2, in
the form of F×2^E
. Both E
and
F
can be positive as well as negative. Modern
computers adopt IEEE 754 standard for representing floating-point
numbers. There are two representation schemes: 32-bit
single-precision and 64-bit double-precision.
The IEEE 754 standard specifies a binary16 as having the following format:
The format is laid out as follows:
The format is assumed to have an implicit lead bit with value 1 unless the exponent field is stored with all zeros. Thus only 10 bits of the significand appear in the memory format but the total precision is 11 bits. In IEEE 754 parlance, there are 10 bits of significand, but there are 11 bits of significand precision (log10(211) ≈ 3.311 decimal digits, or 4 digits ± slightly less than 5 units in the last place).
Exponent encoding[edit]
The half-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 15; also known as exponent bias in the IEEE 754 standard.
Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 15 has to be subtracted from the stored exponent.
The stored exponents 000002 and 111112 are interpreted specially.
Exponent | Significand = zero | Significand ≠ zero | Equation |
---|---|---|---|
000002 | zero, −0 | subnormal numbers | (−1)signbit × 2−14 × 0.significantbits2 |
000012, ..., 111102 | normalized value | (−1)signbit × 2exponent−15 × 1.significantbits2 | |
111112 | ±infinity | NaN (quiet, signalling) |
The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum representable value is (2−2−10) × 215 = 65504.
4.1 IEEE-754 32-bit Single-Precision Floating-Point Numbers
In 32-bit single-precision floating-point representation:
S
), with 0 for positive numbers and 1 for negative
numbers.E
).F
).Normalized Form
Let's illustrate with an example, suppose that the 32-bit
pattern is 1 1000 0001 011 0000 0000 0000 0000 0000
,
with:
S = 1
E = 1000 0001
F = 011 0000 0000 0000 0000 0000
In the normalized form, the actual fraction is
normalized with an implicit leading 1 in the form of
1.F
. In this example, the actual fraction is
1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-2 + 1×2^-3 =
1.375D
.
The sign bit represents the sign of the number, with
S=0
for positive and S=1
for negative
number. In this example with S=1
, this is a negative
number, i.e., -1.375D
.
In normalized form, the actual exponent is E-127
(so-called excess-127 or bias-127). This is because we need to
represent both positive and negative exponent. With an 8-bit E,
ranging from 0 to 255, the excess-127 scheme could provide actual
exponent of -127 to 128. In this example,
E-127=129-127=2D
.
Hence, the number represented is
-1.375×2^2=-5.5D
.
De-Normalized Form
Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot represent the number zero! Convince yourself on this!
De-normalized form was devised to represent zero and other numbers.
For E=0
, the numbers are in the de-normalized form.
An implicit leading 0 (instead of 1) is used for the fraction; and
the actual exponent is always -126
. Hence, the number
zero can be represented with E=0
and F=0
(because 0.0×2^-126=0
).
We can also represent very small positive and negative numbers
in de-normalized form with E=0
. For example, if
S=1
, E=0
, and F=011 0000 0000 0000
0000 0000
. The actual fraction is
0.011=1×2^-2+1×2^-3=0.375D
. Since S=1
, it
is a negative number. With E=0
, the actual exponent is
-126
. Hence the number is -0.375×2^-126 =
-4.4×10^-39
, which is an extremely small negative number
(close to zero).
Summary
In summary, the value (N
) is calculated as
follows:
1 ≤ E ≤ 254, N = (-1)^S × 1.F × 2^(E-127)
.
These numbers are in the so-called normalized form. The
sign-bit represents the sign of the number. Fractional part
(1.F
) are normalized with an implicit leading 1. The
exponent is bias (or in excess) of 127
, so as to
represent both positive and negative exponent. The range of
exponent is -126
to +127
.E = 0, N = (-1)^S × 0.F × 2^(-126)
. These
numbers are in the so-called denormalized form. The
exponent of 2^-126
evaluates to a very small number.
Denormalized form is needed to represent zero (with
F=0
and E=0
). It can also represents very
small positive and negative number close to zero.E = 255
, it represents special values, such as
±INF
(positive and negative infinity) and
NaN
(not a number). This is beyond the scope of this
article.Example 1: Suppose that IEEE-754 32-bit floating-point
representation pattern is 0 10000000 110 0000 0000 0000 0000
0000
.
Sign bit S = 0 ⇒ positive number E = 1000 0000B = 128D (in normalized form) Fraction is 1.11B (with an implicit leading 1) = 1 + 1×2^-1 + 1×2^-2 = 1.75D The number is +1.75 × 2^(128-127) = +3.5D
Example 2: Suppose that IEEE-754 32-bit floating-point
representation pattern is 1 01111110 100 0000 0000 0000 0000
0000
.
Sign bit S = 1 ⇒ negative number E = 0111 1110B = 126D (in normalized form) Fraction is 1.1B (with an implicit leading 1) = 1 + 2^-1 = 1.5D The number is -1.5 × 2^(126-127) = -0.75D
Example 3: Suppose that IEEE-754 32-bit floating-point
representation pattern is 1 01111110 000 0000 0000 0000 0000
0001
.
Sign bit S = 1 ⇒ negative number E = 0111 1110B = 126D (in normalized form) Fraction is 1.000 0000 0000 0000 0000 0001B (with an implicit leading 1) = 1 + 2^-23 The number is -(1 + 2^-23) × 2^(126-127) = -0.500000059604644775390625 (may not be exact in decimal!)
Example 4 (De-Normalized Form): Suppose that IEEE-754 32-bit
floating-point representation pattern is 1 00000000 000 0000
0000 0000 0000 0001
.
Sign bit S = 1 ⇒ negative number E = 0 (in de-normalized form) Fraction is 0.000 0000 0000 0000 0000 0001B (with an implicit leading 0) = 1×2^-23 The number is -2^-23 × 2^(-126) = -2×(-149) ≈ -1.4×10^-45
4.2 Exercises (Floating-point Numbers)
Hints:
S=0
, E=1111 1110
(254)
, F=111 1111 1111 1111 1111 1111
.S=0
, E=0000 00001
(1)
, F=000 0000 0000 0000 0000 0000
.S=1
.S=0
, E=0
,
F=111 1111 1111 1111 1111 1111
.S=0
, E=0
,
F=000 0000 0000 0000 0000 0001
.S=1
.Notes For Java Users
You can use JDK methods Float.intBitsToFloat(int
bits)
or Double.longBitsToDouble(long bits)
to
create a single-precision 32-bit float
or
double-precision 64-bit double
with the specific bit
patterns, and print their values. For examples,
System.out.println(Float.intBitsToFloat(0x7fffff)); System.out.println(Double.longBitsToDouble(0x1fffffffffffffL));
4.3 IEEE-754 64-bit Double-Precision Floating-Point Numbers
The representation scheme for 64-bit double-precision is similar to the 32-bit single-precision:
S
), with 0 for positive numbers and 1 for negative
numbers.E
).F
).The value (N
) is calculated as follows:
1 ≤ E ≤ 2046, N = (-1)^S × 1.F ×
2^(E-1023)
.E = 0, N = (-1)^S × 0.F ×
2^(-1022)
. These are in the denormalized form.E = 2047
, N
represents special
values, such as ±INF
(infinity), NaN
(not
a number).4.4 More on Floating-Point Representation
There are three parts in the floating-point representation:
S
) is self-explanatory (0
for positive numbers and 1 for negative numbers).E
), a so-called
bias (or excess) is applied so as to represent
both positive and negative exponent. The bias is set at half of the
range. For single precision with an 8-bit exponent, the bias is 127
(or excess-127). For double precision with a 11-bit exponent, the
bias is 1023 (or excess-1023).F
) (also called the
mantissa or significand) is composed of an
implicit leading bit (before the radix point) and the fractional
bits (after the radix point). The leading bit for normalized
numbers is 1; while the leading bit for denormalized numbers is
0.Normalized Floating-Point Numbers
In normalized form, the radix point is placed after the first
non-zero digit, e,g., 9.8765D×10^-23D
,
1.001011B×2^11B
. For binary number, the leading bit is
always 1, and need not be represented explicitly - this saves 1 bit
of storage.
In IEEE 754's normalized form:
1 ≤ E ≤ 254
with excess of
127. Hence, the actual exponent is from -126
to
+127
. Negative exponents are used to represent small
numbers (< 1.0); while positive exponents are used to represent
large numbers (> 1.0).N = (-1)^S × 1.F × 2^(E-127)
1 ≤ E ≤ 2046
with excess of
1023. The actual exponent is from -1022
to
+1023
, andN = (-1)^S × 1.F × 2^(E-1023)
Take note that n-bit pattern has a finite number of
combinations (=2^n
), which could represent
finite distinct numbers. It is not possible to represent
the infinite numbers in the real axis (even a small range
says 0.0 to 1.0 has infinite numbers). That is, not all
floating-point numbers can be accurately represented. Instead, the
closest approximation is used, which leads to loss of
accuracy.
The minimum and maximum normalized floating-point numbers are:
Precision | Normalized N(min) | Normalized N(max) |
---|---|---|
Single | 0080 0000H 0 00000001 00000000000000000000000B E = 1, F = 0 N(min) = 1.0B × 2^-126 (≈1.17549435 × 10^-38) |
7F7F FFFFH 0 11111110 00000000000000000000000B E = 254, F = 0 N(max) = 1.1...1B × 2^127 = (2 - 2^-23) × 2^127 (≈3.4028235 × 10^38) |
Double | 0010 0000 0000 0000H N(min) = 1.0B × 2^-1022 (≈2.2250738585072014 × 10^-308) |
7FEF FFFF FFFF FFFFH N(max) = 1.1...1B × 2^1023 = (2 - 2^-52) × 2^1023 (≈1.7976931348623157 × 10^308) |
Denormalized Floating-Point Numbers
If E = 0
, but the fraction is non-zero, then the
value is in denormalized form, and a leading bit of 0 is assumed,
as follows:
E = 0
,N = (-1)^S × 0.F × 2^(-126)
E = 0
,N = (-1)^S × 0.F × 2^(-1022)
Denormalized form can represent very small numbers closed to zero, and zero, which cannot be represented in normalized form, as shown in the above figure.
The minimum and maximum of denormalized floating-point numbers are:
Precision | Denormalized D(min) | Denormalized D(max) |
---|---|---|
Single | 0000 0001H 0 00000000 00000000000000000000001B E = 0, F = 00000000000000000000001B D(min) = 0.0...1 × 2^-126 = 1 × 2^-23 × 2^-126 = 2^-149 (≈1.4 × 10^-45) |
007F FFFFH 0 00000000 11111111111111111111111B E = 0, F = 11111111111111111111111B D(max) = 0.1...1 × 2^-126 = (1-2^-23)×2^-126 (≈1.1754942 × 10^-38) |
Double | 0000 0000 0000 0001H D(min) = 0.0...1 × 2^-1022 = 1 × 2^-52 × 2^-1022 = 2^-1074 (≈4.9 × 10^-324) |
001F FFFF FFFF FFFFH D(max) = 0.1...1 × 2^-1022 = (1-2^-52)×2^-1022 (≈4.4501477170144023 × 10^-308) |
Special Values
Zero: Zero cannot be represented in the
normalized form, and must be represented in denormalized form with
E=0
and F=0
. There are two
representations for zero: +0
with S=0
and
-0
with S=1
.
Infinity: The value of +infinity (e.g.,
1/0
) and -infinity (e.g., -1/0
) are
represented with an exponent of all 1's (E = 255
for
single-precision and E = 2047
for double-precision),
F=0
, and S=0
(for +INF
) and
S=1
(for -INF
).
Not a Number (NaN): NaN
denotes a
value that cannot be represented as real number (e.g.
0/0
). NaN
is represented with Exponent of
all 1's (E = 255
for single-precision and E =
2047
for double-precision) and any non-zero fraction.
The IEEE 754 standard specifies a binary128 as having:
This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to IEEE 754 quadruple-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number.[3]
The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits: log10(2113) ≈ 34.016). The bits are laid out as:
A binary256 would have a significand precision of 237 bits (approximately 71 decimal digits) and exponent bias 262143.
Exponent encoding[edit]
The quadruple-precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 16383; this is also known as exponent bias in the IEEE 754 standard.
Thus, as defined by the offset binary representation, in order to get the true exponent, the offset of 16383 has to be subtracted from the stored exponent.
The stored exponents 000016 and 7FFF16 are interpreted specially.
Exponent | Significand zero | Significand non-zero | Equation |
---|---|---|---|
000016 | 0, −0 | subnormal numbers | (−1)signbit × 2−16382 × 0.significandbits2 |
000116, ..., 7FFE16 | normalized value | (−1)signbit × 2exponentbits2 − 16383 × 1.significandbits2 | |
7FFF16 | ±∞ | NaN (quiet, signalling) |
The minimum strictly positive (subnormal) value is 2−16494 ≈ 10−4965 and has a precision of only one bit. The minimum positive normal value is 2−16382 ≈ 3.3621 × 10−4932 and has a precision of 113 bits, i.e. ±2−16494 as well. The maximum representable value is 216384 − 216271 ≈ 1.1897 × 104932.