CSc 220 Assignment 3

You Don't Need to Repeat Yourself

Assigned
Due

Sep 19
90 pts
Oct 8

The project involves a program to provide a slight bit of compression of an English text file. The simple technique involves replacing repeated words with numbers to produce another text file. It can provide some modest compression, on the order of 10%.

For instance, consider this input:
It was a good day for snails. I saw snails everywhere, especially on the banks of the river. The river bank was muddy and apparently filled with whatever snails eat. An especially slimy day at the river.
The lame compression program translates this to:
It was a good day for snails. I saw #1 everywhere, especially on the banks of @% river. The #5 bank @! muddy and apparently filled with whatever #1 eat. An #3 slimy @" at @% #5.
Obviously, the program uses @ and # as special characters, and we assume the original file does not contain any of those. Our fancy algorithm compresses the data of 207 bytes into 181, saving a whopping 12%. The program replaces certain words when first seen, and outputs their replacement where they appear next. The word snails appears, and is rememberd, and replaces with #1 when it appears later. The word the is seen, then later replaced with @%.
The input is divided into groups of characters, alternating alphanumeric and non-alphanumeric. For instance, the first few groups in the input file are:
It was a good day for snails . I saw snails everywhere , especially
and so forth. Where there is a line break, the line break becomes part of a non-alphanumeric group, along with any adjecent non-alnum characters. The rules for substitution are:
  1. Non-alphanumeric groups are simply echoed and not substituted.
  2. Alphanumeric groups of fewer than three characters are simply echoed and not substituted.
  3. Alphanumeric groups of more than three characters are replaced with a #n constructs, where n is a counter starting from zero. That is, the first encountered is #0, the second #1, etc.
  4. Alphanumeric groups of three characters are replaced with two-character substitutions of the form @x, where x is any printable ASCII character. Characters are chosen in order from '!' to '~', that is, from ASCII value 33 to 126. This allows the assignment of 94 replacements; any three-character alphanumeric group that appears after the first 94 is simply echoed.

You are given two programs. One is the the decompress program, which will restore the original file from the slightly compressed one. The other is readnext.cpp which is a good starting place for the project. It contains a reader function which will break up the input into groups, and some code to read the file. Presently, it just prints the groups of the input file.

You will want to use two C++ maps to keep track of the substitutions. As you see each replaceable unit, see of you have a substitution for it. If so, print the substitution. If not, echo the word, and enter its future subtitution into the appropriate map.

One More Input Thing

Here is a short story by O. Henry which can serve as a longer input. Our fancy scheme provides something like 9.7% compression of it.

Submission

Don't forget to comment your code.

When your program works well and looks nice, submit over the web here.