How File Compression Works
Posted :: Mar 1, 2005 by Impact

Introduction:
While decompressing a few files I downloaded last night, I began to wonder a bit about how file compression works? As we all know, "zipped" files are normally smaller than their original counterparts, but why and how? I did a bit of research and decided to share my findings with the rest of you.

To ruin all of the surprise, file compression works by eliminating excess information. It finds repeating patterns of some kind and indexing them. The best way to explain this is to look at an example. Although actual compression algorithms are far more complex than what is presented here (more to come on actual algorithms later), the example should give you an idea of the concept.

Basic Compression:
Take the following quote:
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein

If you take the quote at face value it has 27 words, which are comprised of 135 characters which include letters, spaces, punctuation marks. By assuming each character takes up one unit of memory, this sentence would take up 135 units of memory. The next step is to look for repetition. If capitalization is ignored, the following words are repeated (with their number of repetitions in parenthesis): as (4), far (2), refer (2), to (2), reality (2), they (3), are (2), not (2), certain (2).

Take a look at what happens to the sentence if the repeated words are indexed by numbers, and then replaced? Our index would look something like:
1: as
2: far
3: refer
4: to
5: reality
6: they
7: are
8: not
9: certain

After indexing, the sentence would look something like: "1 2 1 the laws of mathematics 3 4 5, 6 7 8 9; and 1 2 1 6 7 9, 6 do 8 3 4 5."

Based on the previous assumption that every character takes one unit of memory, our new (compressed) sentence only needs 76 units. This is a pretty big improvement from the original 135 (56% of the original size). However, the index must be included with the new compressed sentence. Taking into account the numbers and words of the index into the file size, it would take 45 units of memory just to hold the index which bumps the total to 76 + 45 = 121 memory units. The new compressed file size is 121 memory units which is slightly smaller than the original 135, but only a 10% compression has been achieved.

By now you should have a good idea of the basic process of compression. The compressor will look through a file to find common sections, index those sections, and then fill in the original with the indexed version. Imagine if the input sentence from the above example had been a paragraph which would probably have many more repeated words, and therefore more repetition, which usually results in more the compression.


Comments :: 3
View Comments
Slightly More Advanced   Next Page
ATI Catalyst 4.8 & 4.9, Omega,...
Movie Review :: Van Helsing
PC Game Review :: 3d Striptease Demo
CoolerMaster AquaGate-Mini Review
Rumor: WoW Expansion Details Leaked
  Hottest Threads:
Ask Naniipo

  Most Posts Today:
  Most Posts Ever: Anonymous

Copyright 2004-2005 Absolute Insight