Question

In: Computer Science

Find the 3-bit mantissa floating point representation of the following numbers, both by chopping and rounding,...

Find the 3-bit mantissa floating point representation of the following numbers, both by chopping and rounding, and then calculate the associated respective absolute error and relative error:

(a) 11/16

(b) 2.75

Expert Solution

Prerequisite Information

Let us first understand the various elements of the question which we need to calculate using an example (we'll use the fraction 2/3 and to understand the concept) -

1. Converting a fraction or decimal to floating point - To do this, let us take fl(x) as the n–digit floating-point number nearest to x where n (= number of decimal digits)

If number of digits n = 4, fl (1/3) = .3334 x 10⁰ in 4 decimal digit floating point representation.
If number of digits n = 6, fl (1/5) = .000002 x 10⁵ in 6 decimal digit floating point representation.
If number of digits n = 5, fl (4/3) = .13334 x 10¹ in 5 decimal digit floating point representation.

2. Let us take examples and convert decimal/fraction to floating point number with rounding and chopping -

Let fraction number x = 2/3, then 3 decimal floating point representation of fl(x) will be as follows -

Rounding: fl(2/3) = (.667) x 10⁰ rounded
Chopping: fl(2/3) = (.666) x 10⁰ chopped

Let fraction number x = 2/3, then 3 decimal floating point representation of fl(x) will be as follows -

Rounding: fl(5.7) = (.006) x 10³ rounded
Chopping: fl(5.7) = (.005) x 10³ chopped

3. Let us find the relative error and absolute error for the fl(2/3) -

Let p be the original number and p* be the floating value after rounding/chopping -

Rounding: fl(2/3) = (.667) x 10⁰ rounded

Absolute error = |p-p*| = | (2/3) - 0.667 | = 3.3333 x 10^-4
Relative error = |p-p*| / |p|= | (2/3) - 0.667 | / | (2/3) = 5.0 x 10^-4

Similarly, we can find absolute and relative error for chopping.

Solution

(a) x = 11/16, n = 3 fl(11/16) = 0.6875

Rounding

fl(11/16) = (0.688) x 10⁰ rounded
Relative error = | (11/16) - 0.688 | = 5.0 x 10^-4
Absolute error = | (11/16) - 0.688 | / | (11/16) = 7.27272 x 10^-4

Chopping

fl(11/16) = (0.687) x 10⁰ chopped
Relative error = | (11/16) - 0.687 | = 5.0 x 10^-4
Absolute error = | (11/16) - 0.687 | / | (11/16) = 7.27272 x 10^-4

(b) x = 2.75, n = 3 fl(2.75) = 2.75

Rounding

fl(11/16) = (0.028) x 10² rounded
Relative error = | 2.75 - 2.8 | = 5.0 x 10^-2
Absolute error = | 2.75 - 2.8 | / | (2.75) = 1.8181 x 10^-2

Chopping

fl(11/16) = (0.027) x 10² chopped
Relative error = | 2.75 - 2.7 | = 5.0 x 10^-2
Absolute error = | 2.75 - 2.7 | / | (2.75) = 1.8181 x 10^-2

Please let me know in case you face any issues in the above mentioned calculations.

venereology answered 1 year ago

Consider the following 32-bit floating point representation based on the IEEE floating point standard: There is...

Consider the following 32-bit floating point representation based on the IEEE floating point standard: There is a sign bit in the most significant bit. The next eight bits are the exponent, and the exponent bias is 28-1-1 = 127. The last 23 bits are the fraction bits. The representation encodes number of the form V = (-1)S x M x 2E, where S is the sign, M is the significand, and E is the biased exponent. The rules for the...

These questions concern the following 16-bit floating point representation: The first bit is the sign of...

These questions concern the following 16-bit floating point representation: The first bit is the sign of the number (0 = +, 1 = -), the next nine bits are the mantissa, the next bit is the sign of the exponent, and the last five bits are the magnitude of the exponent. All numbers are normalized, i.e. the first bit of the mantissa is one, except for zero which is all zeros. a) What is the largest number? (in both 16-bit...

Concern the following 16-bit floating point representation: The first bit is the sign of the number...

Concern the following 16-bit floating point representation: The first bit is the sign of the number (0 = +, 1 = -), the next nine bits are the mantissa, the next bit is the sign of the exponent, and the last five bits are the magnitude of the exponent. All numbers are normalized, i.e. the first bit of the mantissa is one, except for zero which is all zeros. 1. What's the smallest difference between two consecutive or adjacent numbers?...

c) Using the 32-bit binary representation for floating point numbers, represent the number 1011100110011 as a...

c) Using the 32-bit binary representation for floating point numbers, represent the number 1011100110011 as a 32 bit floating point number. i) A digital camera processes the images images in the real-world and stores them in binary form. Using the principles of digital signal processing, practically explain how this phenomenon occurs.

Convert the following decimal numbers into their 32-bit floating point representation (IEEE single precision). You may...

Convert the following decimal numbers into their 32-bit floating point representation (IEEE single precision). You may use a calculator to do the required multiplications, but you must show your work, not just the solution. 1. -59.75 (ANSW: 11000010011011110000000000000000) 2. 0.3 (ANSW: 00111110100110011001100110011010 (rounded) 00111110100110011001100110011001 (truncated; either answer is fine)) Please show all work

3. IEEE Floating Point Representation What decimal number does the 32-bit IEEE floating point number 0xC27F0000...

3. IEEE Floating Point Representation What decimal number does the 32-bit IEEE floating point number 0xC27F0000 represent? Fill in the requested information in the blanks below. What is the sign of the number (say positive or negative): What is the exponent in decimal format: What is the significand in binary: What is the value of the stored decimal number in decimal (final answer): Credit will be given for your final answer in the blanks and the work shown below.

Write a program that converts a given floating point binary number with a 24-bit normalized mantissa...

Write a program that converts a given floating point binary number with a 24-bit normalized mantissa and an 8-bit exponent to its decimal (i.e. base 10) equivalent. For the mantissa, use the representation that has a hidden bit, and for the exponent use a bias of 127 instead of a sign bit. Of course, you need to take care of negative numbers in the mantissa also. Use your program to answer the following questions: (a) Mantissa: 11110010 11000101 01101010, exponent:...

Determine the IEEE single and double floating point representation of the following numbers: a) -26.25 b)...

Determine the IEEE single and double floating point representation of the following numbers: a) -26.25 b) 15/2

Q1: In the addition of floating-point numbers, how do we adjust the representation of numbers with...

Q1: In the addition of floating-point numbers, how do we adjust the representation of numbers with different exponents? Q2: Answer the following questions: What binary operation can be used to set bits? What bit pattern should the mask have? What binary operation can be used to unset bits? What bit pattern should the mask have? What binary operation can be used to flip bits? What bit pattern should the mask have?

Determine the IEEE single and double floating point representation of the following numbers: a) (15/2) x...

Determine the IEEE single and double floating point representation of the following numbers: a) (15/2) x 2^50 b) - (15/2) x 2^-50 c) 1/5