Question

In: Computer Science

Find the 3-bit mantissa floating point representation of the following numbers, both by chopping and rounding,...

Find the 3-bit mantissa floating point representation of the following numbers, both by chopping and rounding, and then calculate the associated respective absolute error and relative error:

(a) 11/16

(b) 2.75

Solutions

Expert Solution

Prerequisite Information

Let us first understand the various elements of the question which we need to calculate using an example (we'll use the fraction 2/3 and to understand the concept) -

1. Converting a fraction or decimal to floating point - To do this, let us take fl(x) as the n–digit floating-point number nearest to x where n (= number of decimal digits)

If number of digits n = 4, fl (1/3) = .3334 x 100 in 4 decimal digit floating point representation.
If number of digits n = 6, fl (1/5) = .000002 x 105 in 6 decimal digit floating point representation.
If number of digits n = 5, fl (4/3) = .13334 x 101 in 5 decimal digit floating point representation.

2. Let us take examples and convert decimal/fraction to floating point number with rounding and chopping -

Let fraction number x = 2/3, then 3 decimal floating point representation of fl(x) will be as follows -

Rounding: fl(2/3) = (.667) x 100 rounded
Chopping: fl(2/3) = (.666) x 100 chopped

Let fraction number x = 2/3, then 3 decimal floating point representation of fl(x) will be as follows -

Rounding: fl(5.7) = (.006) x 103 rounded
Chopping: fl(5.7) = (.005) x 103 chopped

3. Let us find the relative error and absolute error for the fl(2/3) -

Let p be the original number and p* be the floating value after rounding/chopping -

Rounding: fl(2/3) = (.667) x 100 rounded

Absolute error = |p-p*| = | (2/3) - 0.667 | = 3.3333 x 10-4
Relative error = |p-p*| / |p|= | (2/3) - 0.667 | / | (2/3) = 5.0 x 10-4

Similarly, we can find absolute and relative error for chopping.

Solution

(a) x = 11/16, n = 3 fl(11/16) = 0.6875

Rounding

fl(11/16) = (0.688) x 100 rounded
Relative error = | (11/16) - 0.688 | = 5.0 x 10-4
Absolute error = | (11/16) - 0.688 | / | (11/16) = 7.27272 x 10-4

Chopping

fl(11/16) = (0.687) x 100 chopped
Relative error = | (11/16) - 0.687 | = 5.0 x 10-4
Absolute error = | (11/16) - 0.687 | / | (11/16) = 7.27272 x 10-4

(b) x = 2.75, n = 3 fl(2.75) = 2.75

Rounding

fl(11/16) = (0.028) x 102 rounded
Relative error = | 2.75 - 2.8 | = 5.0 x 10-2
Absolute error = | 2.75 - 2.8 | / | (2.75) = 1.8181 x 10-2

Chopping

fl(11/16) = (0.027) x 102 chopped
Relative error = | 2.75 - 2.7 | = 5.0 x 10-2
Absolute error = | 2.75 - 2.7 | / | (2.75) = 1.8181 x 10-2

Please let me know in case you face any issues in the above mentioned calculations.


Related Solutions

Consider the following 32-bit floating point representation based on the IEEE floating point standard: There is...
Consider the following 32-bit floating point representation based on the IEEE floating point standard: There is a sign bit in the most significant bit. The next eight bits are the exponent, and the exponent bias is 28-1-1 = 127. The last 23 bits are the fraction bits. The representation encodes number of the form V = (-1)S x M x 2E, where S is the sign, M is the significand, and E is the biased exponent. The rules for the...
These questions concern the following 16-bit floating point representation: The first bit is the sign of...
These questions concern the following 16-bit floating point representation: The first bit is the sign of the number (0 = +, 1 = -), the next nine bits are the mantissa, the next bit is the sign of the exponent, and the last five bits are the magnitude of the exponent. All numbers are normalized, i.e. the first bit of the mantissa is one, except for zero which is all zeros. a) What is the largest number? (in both 16-bit...
Concern the following 16-bit floating point representation: The first bit is the sign of the number...
Concern the following 16-bit floating point representation: The first bit is the sign of the number (0 = +, 1 = -), the next nine bits are the mantissa, the next bit is the sign of the exponent, and the last five bits are the magnitude of the exponent. All numbers are normalized, i.e. the first bit of the mantissa is one, except for zero which is all zeros. 1. What's the smallest difference between two consecutive or adjacent numbers?...
c) Using the 32-bit binary representation for floating point numbers, represent the number 1011100110011 as a...
c) Using the 32-bit binary representation for floating point numbers, represent the number 1011100110011 as a 32 bit floating point number. i) A digital camera processes the images images in the real-world and stores them in binary form. Using the principles of digital signal processing, practically explain how this phenomenon occurs.
Write a program that converts a given floating point binary number with a 24-bit normalized mantissa...
Write a program that converts a given floating point binary number with a 24-bit normalized mantissa and an 8-bit exponent to its decimal (i.e. base 10) equivalent. For the mantissa, use the representation that has a hidden bit, and for the exponent use a bias of 127 instead of a sign bit. Of course, you need to take care of negative numbers in the mantissa also. Use your program to answer the following questions: (a) Mantissa: 11110010 11000101 01101010, exponent:...
Determine the IEEE single and double floating point representation of the following numbers: a) -26.25 b)...
Determine the IEEE single and double floating point representation of the following numbers: a) -26.25 b) 15/2
Q1: In the addition of floating-point numbers, how do we adjust the representation of numbers with...
Q1: In the addition of floating-point numbers, how do we adjust the representation of numbers with different exponents? Q2: Answer the following questions: What binary operation can be used to set bits? What bit pattern should the mask have? What binary operation can be used to unset bits? What bit pattern should the mask have? What binary operation can be used to flip bits? What bit pattern should the mask have?
Determine the IEEE single and double floating point representation of the following numbers: a) (15/2) x...
Determine the IEEE single and double floating point representation of the following numbers: a) (15/2) x 2^50 b) - (15/2) x 2^-50 c) 1/5
Convert the following decimal numbers to 32-bit IEEE floating point: 86.59375 -1.59729 Convert the following 32-bit...
Convert the following decimal numbers to 32-bit IEEE floating point: 86.59375 -1.59729 Convert the following 32-bit IEEE floating point numbers to decimal: 0100 1100 1110 0110 1111 1000 0000 0000 1011 0101 1110 0110 1010 0110 0000 0000
The biggest mysteries of the IEEE 754 Floating-Point Representation are “hidden bit” and “Bias. Can someone...
The biggest mysteries of the IEEE 754 Floating-Point Representation are “hidden bit” and “Bias. Can someone explain to me why the "hidden bits" and "bias" are considered to be mysteries for the IEEE 754 floating point representation
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT