In: Computer Science
Messages comprising seven different characters, A through G, are
to be transmitted over a data link. Analysis has shown that the
related frequency of occurrence of each character is A 0.10, B
0.25, C 0.05, D 0.32, E 0.01, F 0.07, G 0.2
i) Derive the entropy of the messages
ii) Use static Huffman coding to derive a suitable set of
codewords.
iii) Derive the average number of bits per codeword for your
codeword set to transmit a message and compare this with both the
fixed-length binary and ASCII codewords.
iv) State the prefix property of Huffman Coding and hence show
that your codeword set derived in the exercise satisfied
this.
v) Derive a flowchart for an algorithm to decode a received bit
string encoded using your codeword set.
vi) Give an example of the decoding operation assuming the received
bit string comprises a mix of seven characters
Note: I assume you know Huffman Encoding, if have any doubt or find difficulty in understanding anything, leave a comment.
So, we are given the following characters with their probability of occurrence as follows:
A 0.10
B 0.25
C 0.05
D 0.32
E 0.01
F 0.07
G 0.20
Ans i)
Entropy =
where pi is the probability of the chracter
Character | pi | log(1/pi) |
A | 0.1 | 3.32 |
B | 0.25 | 2 |
C | 0.05 | 4.32 |
D | 0.32 | 1.64 |
E | 0.01 | 6.64 |
F | 0.07 | 3.83 |
G | 0.2 | 2.32 |
Entropy = (0.1 * 3.32) + (0.25 * 2) + (0.05 * 4.32) + (0.32 * 1.64) + (0.01 * 6.64) + (0.07 * 3.83) + (0.2 * 2.32)
=> 0.332 + 0.5 + 0.216 + 0.524 + 0.066 + 0.268 + 0.622
=> 2.525 bits
So, the entropy of the message is 2.525 bits
Ans ii)
Huffman Tree is shown below:
Codewords:
A -- 010
B -- 10
C -- 01101
D -- 11
E -- 01100
F -- 0111
G -- 00
Ans iii)
Average no. of bits per crowdword (Huffman Encoding)
Sum of (Bit taken by character * Given probability) / 1 (Total sum of probabilities)
=> ( ( 0.1 * 3 ) + ( 0.25 * 2 ) + ( 0.05 * 5 ) + ( 0.32 * 2 ) + ( 0.01 * 5 ) + ( 0.07 * 4 ) + ( 0.20 * 2 ) ) / (0.1 + 0.25 + 0.05 + 0.32 + 0.01 + 0.07 + 0.20)
=> 2.42 / 1 => 2.42
Average no. of bits per crowdword (Fixed-length binary)
We are given 7 characters, so minimum bits require to represent each character uniquely
2n > 7
gives n = 3
So, we require 3 bits to represent each character uniquely.
So, (0.1 + 0.25 + 0.05 + 0.32 + 0.01 + 0.07 + 0.20) * 3 = 3
Average no. of bits per crowdword (ASCII codewords)
One character is represented by 8 bits in ASCII codeword
So, average no. of bits = 8 * 1 = 8
Message length reduced by:
Compared with Fixed-length binary:
(3 - 2.42) / 3 = 0.19 = 19%
Compared with ASCII codewords:
(8 - 2.42) / 8 = 0.69 = 69%
Ans iv) Prefix Property of Huffman Coding: It states that no codeword is prefix to another codeword. or It states that when we form a message using above formed codewords, we must be able to uniquely identify each character. So, if we use Huffman coding, we'll always be able to uniquely identify each character, wherever it has been used in the string.
So, lets check this for our solution:
We can clearly see that no combination of the bits is prefix to any codeword.
Let, we start with the most significant bit and let it be 0, now if second most significant bit is 0 means, G is encountered, and it second most significant bit is 1 it can be either from A, C, E, or F.
Let's move to the third most significant bit, if it is 0 means A is encountered, else in case of 1, we can have C, E or F. So like this we can tally for all the characters and we will always find that no codeword is prefix to any other codeword.
Let take and example: 0000011000001001101
We can break it into 00 00 01100 00 010 01101
i.e G G C G A C = GGCGAC
Ans v) Refer to the following flowchart.
If we find a character, we stop and start with the c0, in search of new character.
I haven't used that box representation, as it could made difficult in understanding due to various lines.
Ans vi)
Eg. "011001100011011001011"
Using the above algorithm/ flowchart break it into the characters.
Let's go for first character
c0 = 0 (0)
c1 = 1 (01)
c2 = 1 (011)
c3 = 0 (0110)
c4 = 0 (01100) (E found)
Check for all the characters in the same manner.
So input will break in 011001100011011001011
01100 11 00 01101 10 010 11
E D G C B A D
String is EDGCBAD