You Don't Need to Repeat Yourself
The project involves a program to provide a slight bit of
compression of an English text file. The simple technique involves
replacing repeated words with numbers to produce another text file.
It can provide some modest compression, on the order of 10%.
For instance, consider this input:
It was a good day for snails. I saw snails everywhere, especially on the
banks of the river. The river bank was muddy and apparently filled with
whatever snails eat. An especially slimy day at the river.
The lame compression program translates this to:
It was a good day for snails. I saw #1 everywhere, especially on the
banks of @% river. The #5 bank @! muddy and apparently filled with
whatever #1 eat. An #3 slimy @" at @% #5.
Obviously, the program uses
@
and
# as special characters, and
we assume the original file does not contain any of those. Our
fancy algorithm compresses the data of 207 bytes into 181, saving a
whopping 12%. The program replaces certain words when first seen,
and outputs their replacement where they appear next. The word
snails appears, and is rememberd, and replaces with
#1
when it appears later. The word
the is seen, then later
replaced with
@%.
The input is divided into groups of characters, alternating
alphanumeric and non-alphanumeric. For instance, the first
few groups in the input file are:
It was
a good day
for snails . I
saw snails
everywhere , especially
and so forth. Where there is a line break, the line break becomes part
of a non-alphanumeric group, along with any adjecent non-alnum characters.
The rules for substitution are:
- Non-alphanumeric groups are simply echoed and not substituted.
- Alphanumeric groups of fewer than three characters
are simply echoed and not substituted.
- Alphanumeric groups of more than three characters are replaced with
a #n constructs, where n is a counter starting from zero. That is,
the first encountered is #0, the second #1, etc.
- Alphanumeric groups of three characters are replaced with
two-character substitutions of the form
@x, where x is any
printable ASCII character. Characters are chosen in order from
'!' to '~', that is, from ASCII value 33 to 126. This
allows the assignment of 94 replacements; any three-character
alphanumeric group that appears after the first 94 is simply echoed.
You are given two programs. One is the
the decompress program, which will
restore the original file from the slightly compressed one.
The other is readnext.cpp which
is a good starting place for the project. It contains a reader function
which will break up the input into groups, and some
code to read the file. Presently, it just prints the groups of the
input file.
You will want to use two C++ maps to keep track of the substitutions.
As you see each replaceable unit, see of you have a substitution for it.
If so, print the substitution. If not, echo the word, and enter its
future subtitution into the appropriate map.
One More Input Thing
Here is a
short story by O. Henry which can serve as a longer
input. Our fancy scheme provides something like 9.7% compression of it.
Submission
Don't forget to comment your code.
When your program works well and looks nice,
submit over the web
here.