Question

In: Computer Science

Find the 3-bit mantissa floating point representation of the following numbers, both by chopping and rounding,...

Find the 3-bit mantissa floating point representation of the following numbers, both by chopping and rounding, and then calculate the associated respective absolute error and relative error:

(a) 11/16

(b) 2.75

Solutions

Expert Solution

Prerequisite Information

Let us first understand the various elements of the question which we need to calculate using an example (we'll use the fraction 2/3 and to understand the concept) -

1. Converting a fraction or decimal to floating point - To do this, let us take fl(x) as the n–digit floating-point number nearest to x where n (= number of decimal digits)

If number of digits n = 4, fl (1/3) = .3334 x 100 in 4 decimal digit floating point representation.
If number of digits n = 6, fl (1/5) = .000002 x 105 in 6 decimal digit floating point representation.
If number of digits n = 5, fl (4/3) = .13334 x 101 in 5 decimal digit floating point representation.

2. Let us take examples and convert decimal/fraction to floating point number with rounding and chopping -

Let fraction number x = 2/3, then 3 decimal floating point representation of fl(x) will be as follows -

Rounding: fl(2/3) = (.667) x 100 rounded
Chopping: fl(2/3) = (.666) x 100 chopped

Let fraction number x = 2/3, then 3 decimal floating point representation of fl(x) will be as follows -

Rounding: fl(5.7) = (.006) x 103 rounded
Chopping: fl(5.7) = (.005) x 103 chopped

3. Let us find the relative error and absolute error for the fl(2/3) -

Let p be the original number and p* be the floating value after rounding/chopping -

Rounding: fl(2/3) = (.667) x 100 rounded

Absolute error = |p-p*| = | (2/3) - 0.667 | = 3.3333 x 10-4
Relative error = |p-p*| / |p|= | (2/3) - 0.667 | / | (2/3) = 5.0 x 10-4

Similarly, we can find absolute and relative error for chopping.

Solution

(a) x = 11/16, n = 3 fl(11/16) = 0.6875

Rounding

fl(11/16) = (0.688) x 100 rounded
Relative error = | (11/16) - 0.688 | = 5.0 x 10-4
Absolute error = | (11/16) - 0.688 | / | (11/16) = 7.27272 x 10-4

Chopping

fl(11/16) = (0.687) x 100 chopped
Relative error = | (11/16) - 0.687 | = 5.0 x 10-4
Absolute error = | (11/16) - 0.687 | / | (11/16) = 7.27272 x 10-4

(b) x = 2.75, n = 3 fl(2.75) = 2.75

Rounding

fl(11/16) = (0.028) x 102 rounded
Relative error = | 2.75 - 2.8 | = 5.0 x 10-2
Absolute error = | 2.75 - 2.8 | / | (2.75) = 1.8181 x 10-2

Chopping

fl(11/16) = (0.027) x 102 chopped
Relative error = | 2.75 - 2.7 | = 5.0 x 10-2
Absolute error = | 2.75 - 2.7 | / | (2.75) = 1.8181 x 10-2

Please let me know in case you face any issues in the above mentioned calculations.


Related Solutions

Consider the following 32-bit floating point representation based on the IEEE floating point standard: There is...
Consider the following 32-bit floating point representation based on the IEEE floating point standard: There is a sign bit in the most significant bit. The next eight bits are the exponent, and the exponent bias is 28-1-1 = 127. The last 23 bits are the fraction bits. The representation encodes number of the form V = (-1)S x M x 2E, where S is the sign, M is the significand, and E is the biased exponent. The rules for the...
These questions concern the following 16-bit floating point representation: The first bit is the sign of...
These questions concern the following 16-bit floating point representation: The first bit is the sign of the number (0 = +, 1 = -), the next nine bits are the mantissa, the next bit is the sign of the exponent, and the last five bits are the magnitude of the exponent. All numbers are normalized, i.e. the first bit of the mantissa is one, except for zero which is all zeros. a) What is the largest number? (in both 16-bit...
Concern the following 16-bit floating point representation: The first bit is the sign of the number...
Concern the following 16-bit floating point representation: The first bit is the sign of the number (0 = +, 1 = -), the next nine bits are the mantissa, the next bit is the sign of the exponent, and the last five bits are the magnitude of the exponent. All numbers are normalized, i.e. the first bit of the mantissa is one, except for zero which is all zeros. 1. What's the smallest difference between two consecutive or adjacent numbers?...
Write a program that converts a given floating point binary number with a 24-bit normalized mantissa...
Write a program that converts a given floating point binary number with a 24-bit normalized mantissa and an 8-bit exponent to its decimal (i.e. base 10) equivalent. For the mantissa, use the representation that has a hidden bit, and for the exponent use a bias of 127 instead of a sign bit. Of course, you need to take care of negative numbers in the mantissa also. Use your program to answer the following questions: (a) Mantissa: 11110010 11000101 01101010, exponent:...
Determine the IEEE single and double floating point representation of the following numbers: a) -26.25 b)...
Determine the IEEE single and double floating point representation of the following numbers: a) -26.25 b) 15/2
Determine the IEEE single and double floating point representation of the following numbers: a) (15/2) x...
Determine the IEEE single and double floating point representation of the following numbers: a) (15/2) x 2^50 b) - (15/2) x 2^-50 c) 1/5
Represent the following decimal numbers using IEEE-754 floating point representation. A. -0.375 B. -Infinity C. 17...
Represent the following decimal numbers using IEEE-754 floating point representation. A. -0.375 B. -Infinity C. 17 D. 5.25
Given the following 32-bit binary sequences representing single precision IEEE 754 floating point numbers: a =...
Given the following 32-bit binary sequences representing single precision IEEE 754 floating point numbers: a = 0100 0000 1101 1000 0000 0000 0000 0000 b = 1011 1110 1110 0000 0000 0000 0000 0000 Perform the following arithmetic and show the results in both normalized binary format and IEEE 754 single-precision format. Show your steps. a)     a + b b)     a × b
Convert the following number into 32bit IEEE 754 floating point representation. 0.000101
Convert the following number into 32bit IEEE 754 floating point representation. 0.000101
Assume that you have a 12-bit floating point number system, similar to the IEEE floating point...
Assume that you have a 12-bit floating point number system, similar to the IEEE floating point standard, with the format shown below and a bias of 7. The value of a floating point number in this system is represented as    FP = (-1)^S X 1.F X 2^(E-bias) for the floating point numbers A = 8.75 and B = -5.375. The binary representation of A is given as A = 0101 0000 1100 Show the hexidecimal representation of B.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT