3_3_8 Data compression

You should be able to:

  • Explain what data compression is.

  • Understand why data may be compressed and that there are different ways to compress data.

  • Explain how data can be compressed using Huffman coding.

  • Be able to interpret/create Huffman trees.

  • Be able to calculate the number of bits required to store a piece of data compressed using Huffman coding.

  • Be able to calculate the number of bits required to store a piece of uncompressed data in ASCII.

  • Explain how data can be compressed using run length encoding (RLE).

  • Represent data in RLE frequency/data pairs.

REVISE:

This video is a very comprehensive overview of compression. Remember that you only need to know about RLE and Huffman for your examination.

What is data compression and why is it needed?

Data compression means to reduce the file size. We need to compress files to make it easier to store data and transfer it across devices or networks.

We can use Lossy or Lossless compression.

Lossy - removes data to reduce the file size, the removed data is then lost.

Lossless - shrinks the whole file whilst maintaining the quality, the file can then be restored to its original state.

A common example of lossy compression is taking a Bitmap image and using jpeg compression.

A bitmap image records each individual pixel.

A jpeg records each group of similar pixels.

A jpeg is lossy because it compresses groups of similar pixels. If there is a subtle shade of grey that is very close to black then it will be grouped with the other black colours. The image below shows an example of a bitmap (left) after it has had jpeg compression (right) applied.

Jpeg compression is typically very subtle and undetectable to the human eye.

Run Length Encoding

Run Length Encoding (RLE) is a lossless compression algorithm.

If we take some data as an example:

111000111111100000111

RLE will create data pairs to reduce the file size and avoid repetition.

The data pairs for the above data would look like this:

3 1 3 0 7 1 5 0 3 1

So we are saying... 3 1's, 3 0's, 7 1's, 5 0's and 3 1's.

RLE is simple to do but it is most effective when there is lots of repetition of data.

Huffman Coding

If we had a piece of text that was created using the ASCII character set and contained 4000 characters, what would the file size be?

file size in bits = (7 bits per character * 4000 characters)

28,000 bits or 3,500 bytes or 3.5 kilobytes.

This isn't too bad but if we had to store the characters written on Wikipedia it would take up a lot of storage space!

There is a lot of repetition in a text file because we repeat letters from the alphabet, spaces and other symbols quite a lot. There are also more popular letters like the vowels (a,e,i,o,u).

David Huffman noticed this pattern and created the Huffman Coding Algorithm.

Watch the video below to see how it works!

(remember that you can pause, stop, rewind and #panic at any point!)

TEST:

  1. Download and print the test paper here: https://drive.google.com/open?id=0B5fLtQ0Xgr2PczlTUVFseGs3MGc

  2. Try the mock test yourself.

  3. Use the 3.3.8 Walking Talking Mock below to guide you through answering the questions.

SOURCE RECOGNITION - PLEASE NOTE: The examination examples used in these walking talking mocks are samples from AQA from their non-confidential section of the public site. They also contain questions designed by TeachIT for AQA as part of the publicly available lesson materials.