In: Computer Science
Question about IEEE 754 and ARM Assembly:
So as we know, numbers with fractional components such as 45.278 are commonly represented with single precision or double precision IEEE 754 Floating point format (also called binary32 or binary64, respectively). How are numbers stored in these formats? and briefly describe:
The fields in the binary32 format (size, purpose, etc)
How to convert a binary32 number to a decimal value (write a formula)
Usually, a real number in binary will be represented in the following format,
ImIm-1…I2I1I0.F1F2…FnFn-1
Where Im and Fn will be either 0 or 1 of integer and fraction parts respectively.
A finite number can also represented by four integers components, a sign (s), a base (b), a significand (m), and an exponent (e). Then the numerical value of the number is evaluated as
(-1)s x m x be ________ Where m < |b|
Depending on base and the number of bits used to encode various components, the IEEE 754 standard defines five basic formats. Among the five formats, the binary32 and the binary64 formats are single precision and double precision formats respectively in which the base is 2.
Single Precision Format (or) Binary32 Format:
As mentioned in Table 1 the single precision format has 23 bits for significand (1 represents implied bit, details below), 8 bits for exponent and 1 bit for sign.
For example, the rational number 9÷2 can be converted to single precision float format as following,
9(10) ÷ 2(10) = 4.5(10) = 100.1(2)
The result said to be normalized, if it is represented with leading 1 bit, i.e. 1.001(2) x 22. (Similarly when the number 0.000000001101(2) x 23 is normalized, it appears as 1.101(2) x 2-6). Omitting this implied 1 on left extreme gives us the mantissa of float number. A normalized number provides more accuracy than corresponding de-normalized number. The implied most significant bit can be used to represent even more accurate significand (23 + 1 = 24 bits) which is called subnormal representation. The floating point numbers are to be represented in normalized form.
The subnormal numbers fall into the category of de-normalized numbers. The subnormal representation slightly reduces the exponent range and can’t be normalized since that would result in an exponent which doesn’t fit in the field. Subnormal numbers are less accurate, i.e. they have less room for nonzero bits in the fraction field, than normalized numbers. Indeed, the accuracy drops as the size of the subnormal number decreases. However, the subnormal representation is useful in filing gaps of floating point scale near zero.
In other words, the above result can be written as (-1)0 x 1.001(2) x 22 which yields the integer components as s = 0, b = 2, significand (m) = 1.001, mantissa = 001 and e = 2. The corresponding single precision floating number can be represented in binary as shown below,
Where the exponent field is supposed to be 2, yet encoded as 129 (127+2) called biased exponent. The exponent field is in plain binary format which also represents negative exponents with an encoding (like sign magnitude, 1’s compliment, 2’s complement, etc.). The biased exponent is used for representation of negative exponents. The biased exponent has advantages over other negative representations in performing bitwise comparing of two floating point numbers for equality.
A bias of (2n-1 – 1), where n is # of bits used in exponent, is added to the exponent (e) to get biased exponent (E). So, the biased exponent (E) of single precision number can be obtained as
E = e + 127
The range of exponent in single precision format is -126 to +127. Other values are used for special symbols.
Note: When we unpack a floating point number the exponent obtained is biased exponent. Subtracting 127 from the biased exponent we can extract unbiased exponent.
Float Scale:
The following figure represents
floating point scale.