CS301 Data Structures: Assignment 7-Huffman Encoding

Instructor: Sohail Aslam
Assigned: January 15, 2004
Due Date: Phase 1: January 23, 2004
                   Phase 2: January 28, 2004

Goal of the Assignment

Work with Huffman Encoding for text compression, work with priority queue ADT, work with STL (Standard Template Libaray) bitset ADT.

Files Provided

  1. Heap.h, heap.cpp: files for the Heap ADT.
  2. TreeNode.cpp: TreeNode class for nodes in the Huffman tree. The node stores a character and its frequency.
  3. bitset.cpp: demonstrates use of the STL bitset class and command line processing.

Assignment Tasks  (70 points total) 

Write a program to create a Huffman code for a message. The program should read in an ASCII datafile specified on the command line. The file will contain the message to be encoded. For example

                             c:\> huffman input.dat

The assignment is to be completed in two phases.

Phase 1 Tasks

  1. (5 pts) Read the message to be encoded and compute character frequencies for each character in the message (include the EOF character). Store these frequencies in an array indexed by each character's ASCII code. Print out each character with its frequency.
  2. (20 pts) Create a priority queue prioritized by the smallest frequency for the characters in the message. The priority queue is to be implemented using the heap ADT. The code has been covered in the lectures. Using the priority queue, create a Huffman tree for the message. Your program should print the tree out. Printing a binary horizontally was done in the last practice assignement.
  3. (15 pts) Using the generated Huffman tree, traverse it and create and print out a code table for the given input message below: 

    Manny's mom and dad are Huffman encoding experts.

    You must include the end of line character. Your table would be different; the output format should be similar to the table below.

                       CODE TABLE
    CHARACTER ASCII FREQUENCY BIT PATTERN CODELENGTH (# of bits)
    ------------------------------------------------------------
    NL 10 1 110011 6
    32 7 100 3
    ' 39 1 101111 6
    . 46 1 111000 6
    H 72 1 00101 5
    M 77 1 110110 6
    a 97 5 010 3
    c 99 1 101100 6
    d 100 4 1010 4
    e 101 4 1111 4
    f 102 2 11101 5
    g 103 1 00010 5
    i 105 1 110111 6
    m 109 3 0011 4
    n 110 6 011 3
    o 111 2 11000 5
    p 112 1 101101 6
    r 114 2 11010 5
    s 115 2 0000 4
    t 116 1 111001 6
    u 117 1 00100 5
    x 120 1 101110 6
    y 121 1 00011 5

Phase 2 Tasks

  1. (15 pts) Using the code table, write an encode function which encodes the message in a bitset (a class in the STL which provides for bit processing) and print the encoded message out as a bit pattern.

    50 characters encoded in 216 bits, 4.320000 bits per char. (your's may be different)

  2. (15 pts) Write a decode function that decodes the bit patterns back into the original characters using the Huffman tree:
    DECODED MESSAGE: Manny's mom and dad are Huffman encoding experts.

Huffman Binary Tree

The Huffman tree is a binary tree (not a binary search tree). The leaf nodes contain the characters that make up the message; each leaf node contains one character. The node also contains the frequency of the character. The internal nodes contain only frequencies. The frequency in an internal node is equal to the sum of the frequencies of its left and right children nodes. The class TreeNode provided with this assignment has all the necessary mechanism to generate tree nodes, both leaf and internal, for the Huffman binary tree.

The tree nodes, leaf and internal, will be placed in a priority queue as the tree building algorithm proceeds. The priority queue will be implemented using the heap ADT. Code for heap ADT has been provided. The heap ADT in the files provided can contain pointer to tree nodes, i.e.,

  // create a tree node with the ith character of the message and its frequency
  TreeNode* node = new TreeNode(msg[i], freq[i]);
  heap.insert( freq[i], node );
   ....
   ....
  // insert a new internal node
  TreeNode* n1 = heap.removeMin();
  TreeNode* n2 = heap.removeMin();
  int sumOfFrequencies = n1->getFrequency()+n2->getFrequency();
  TreeNode* internalNode = new TreeNode(n1->getCharacter(), sumOfFrequencies);
  internalNode->setLeft( n1);
  internalNode->setRight( n2);
  heap.insert( sumOfFrequencies, internalNode );
Eventually, there will be only one tree node in the priority queue (heap). This will be the root of the Huffman tree. Standard tree traversal methods can be used to visit the tree.

STL bitset

The Huffman codes for each character in the message will be series of 1's and 0's. These can be stored in a character array but this way each 1 or 0 will take up 8 bits; a character is 8 bits (byte) long. This would be wasteful. In order to store 1's and 0's, we can use the bitset class provided by the Standard Template Library (STL).

The STL class std::bitset provides a container for working with bits. It can be thought of as an array where each element is one bit long. Another way to think of bitset is as an integer, each of whose bits you can access individually — except it’s an integer that isn’t restricted to the size of a long. The size of a bitset is determined at compile time (the number of bits is a template parameter), but there is no upper limit: a bitset<32> is 32 bits long, and a bitset<1000> is 1000 bits.

All of the integer bitwise operations that you’re already used to are still available with bitset, and, for the sake of convenience, a few more as well. So, for example, you can perform bitwise-xor by writing b1 ^ b2 (at least if b1 and b2 are the same length). There are two different interfaces for manipulating individual bits: you can set the nth bit with b.set(n), clear it with b.reset(n), and test it with if (b.test(n)); or, almost equivalently, you can think of a bitset as an array and write these same operations as b[n] = true, b[n] = false, and if (b[n]). (“Almost,” because there is one small difference: the array version isn’t range checked, but the set/reset/test version is. If the argument you pass to set, reset, or test is too large, you’ll get an out_of_range exception.)

The zeroth bit is defined to be the least significant bit, so, for example, if you write:

std::bitset<4> b(0xA);

the bits that are set are b[1] and b[3].

bitset provides the usual I/O operators. This program,

#include <bitset>
#include <iostream>

int main() {
std::bitset<12> b(3432);
std::cout << "3432 in binary is "
<< b << std::endl;
}
gives the obvious result:

       3432 in binary is 110101101000.

The input operator works the same way: it reads a string of ones and zeros, converting them to a bitset.

What to turn in

Turn in your DevCpp project: the .dev file and all .cpp and .h files. We will build the .exe and run it on test data to see if it works correctly.