KWIC Index
A KWIC (Keyword In Context)
index is simple way to display a list of titles by the words
they contain. For instance, given the input:
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
Pirates of the Caribbean: The Curse of the Black Pearl
The Curse of the Werewolf
the program should produce:
1. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
2. Pirates of the Caribbean: The Curse of the Black Pearl
3. The Curse of the Werewolf
2|e Caribbean: The Curse of the Black Pearl
1|to Stop Worrying and Love the Bomb
2| Pirates of the Caribbean: The Curse of the Black Pearl
2|Pirates of the Caribbean: The Curse of the Black Pearl
3| The Curse of the Werewolf
1| Dr. Strangelove or: How I Learned to Stop Worrying and Love th
1| Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
1| Learned to Stop Worrying and Love the Bomb
2|bbean: The Curse of the Black Pearl
2| Pirates of the Caribbean: The Curse of the
1|angelove or: How I Learned to Stop Worrying and Love the Bomb
1| Dr. Strangelove or: How I Learned to Stop Worr
3| The Curse of the Werewolf
1|ove or: How I Learned to Stop Worrying and Love the Bomb
The program echos the input list of titles, with index numbers from 1,
then produces the index you see, where the titles are listed aligned
with each word aligned in the center, and ordered alphabetically.
Specifically, a word here is made of alphabetic characters, and
we ignore words two characters long or less, and also ignore the and and.
The words are sorted case-insensitively, though this example does not
particularly show that.
Details
Your program should:
- Read titles as lines of input from standard input.
- Echo these lines with numbers starting from 1.
- Generate a KWIC index in which each line consists of a title number
and the title, with the title shifted so that one of the key words is
starts in the index position.
- The index position is 35, zero-based counting from the left.
- The number should take four positions (0 to 3), with a
separator character in position 4.
- The title should be located in positions 5 to 76. Any characters
which would fall left or right of these will be omitted.
- The titles should be ordered alphabetically by the word in the
index column. Order should be case-insensitive.
- A word is a maximal contiguous group of alphabetic characters.
- Generate an index line for each word of each title, except those
two characters or shorter, or for “and” or “the”.
Approaches
There
are several ways to write this program. My solution follows these
broad steps:
- Loop through the lines of input to read each title, and add them
to a container. I used a vector.
- Loop through each title. For each title, find each word (see more
below) and
place each word found into an index. The index must hold each word, and
record saying which title contains it, and the position within the title.
- Print the numbered list of titles.
- Print the contents of the index in proper order, aligning each
title as described above.
The first two steps can be two successive loops, or they can be
done in a single loop.
If you keep a vector of the titles, your
index can be a map from each word to a list of its locations, like this:
map<string,list<pair<int,int>>>
The key string is the word in lower case, and the data is a list of pairs
of integers.
The
first of each pair is the subscript of the titles vector, and the
second is the location of the word inside the title.
For instance,
in the above example, the entry for the key
black
is a singleton list containing the pair 1, 48. For the
key
curse, there a list of two pairs 1, 33 and 2, 4.
After filling the index map, go through it using a double loop,
outer for the entries (words), then inner through the list associated with
each key. The loop body then prints the title line number (subscript
plus 1), and then formats the title to line up the word position
correctly. The Word Counter example shows how to use the
IO manipulators to print a number in a specified width. To align the
title, I used substring to separate it into the left part and the
right part, then used substring again to print the needed portion, or
generated blank padding as required.
Some alternatives include a different sort of index. As an
alternative, I solved
the problem with a
C++
multimap from a string to a pair instead of
the map from string to list of pairs.
multimap<string, pair<int,int>>
Since multimap allows keys to repeat, you just insert them multiple times
instead of needing the list. The printing now needs only a single loop.
Multimap is a bit different, though. Since keys are not
unique, it does not have subscripting or an at method.
The map or multimap is an ordered structure, so it puts the keys in order
for you. I was also able to solve the problem using a vector of entries
for each word, which needed to be sorted before printing. This is
easily handled by the standard sort method. The items must hold the word,
title position, and word location. This could be done with a pair
combining a string
and a pair of integers,
but I used a C++
tuple, which is much like a pair, but can have any number of entries.
The titles could alternatively be put into a list, and the index could retain
an iterator, but you would probably have to also retain the number of
the title, since you can't find it out from the iterator, as you can with
the list subscript.
General Advice
The program reads in titles one per line, and must find the
words in the title. Here is
a small program with a function to that. The find_word
function takes a string and returns a list of pairs giving the
positions of the start and character following each word. Run it to see.
find_word uses the C++ standard
pair
utility object. This is
the same thing that a map contains: just two data items creatively
called first and second. As you can see in that code, the
easiest way to create one is to use the make_pair function, which
simply takes the needed values and returns the pair. Then, when you
have one, simply refer to the first and second attributes
to recover each part.
C++ has
a
tolower that returns the lower-case version of
a character, but it does not (for some reason) have one to operate on
a string. However, you can down-case a string (call it
s) using
this one simple trick:
for(auto &c: s) { c = tolower(c); }
It won't work if you forget the
&.
When your program works, is well-commented and properly indented,
submit over the web
here.