Huffman Encoding

About

Huffman encoding is an algorithm devised by David A. Huffman of MIT in 1952 for compressing text data to make a file occupy a smaller number of bytes. This relatively simple compression algorithm is powerful enough that variations of it are still used today in computer networks, fax machines, modems, HDTV, and other areas.

Normally text data is stored in a standard format of 8 bits per character using an encoding called ASCII that maps every character to a binary integer value from 0-255. (ASCII encoding table) The idea of Huffman encoding is to abandon the rigid 8-bits-per-character requirement and use different-length binary encodings for different characters. The advantage of doing this is that if a character occurs frequently in the file, such as the common letter 'e', it could be given a shorter encoding (fewer bits), making the file smaller. The tradeoff is that some characters may need to use encodings that are longer than 8 bits, but this is reserved for characters that occur infrequently, so the extra cost is worth it.

The four steps involved in Huffman encoding a given text source file into a destination compressed file are:

  1. count character frequencies (buildFrequencyTable): Examine a source file's contents and count the number of occurrences of each character.
  2. build a Huffman encoding tree (buildEncodingTree): Build a binary tree with a particular structure, where each node represents a character and its count of occurrences in the file. A priority queue is used to help build the tree along the way.
  3. build a character encoding map (buildEncodingMap): Traverse the binary tree to discover the binary encodings of each character.
  4. encode the file's data (encodeData): Re-examine the source file's contents, and for each character, output the encoded binary version of that character to the destination file.

Technologies & Concepts

  • C++
  • Binary trees
  • Tree traversal
  • Binary I/O streams
    • Function Declarations
    • Console+File I/O
    • Strings
    • Loops
    • Basic Data Structures: Vector, Grid