CSc 220 Assignment 2

KWIC Index

Assigned
Due

Feb 17
80 pts
Mar 4
A KWIC (Keyword In Context) index is simple way to display a list of titles by the words they contain. For instance, given the input:
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb Pirates of the Caribbean: The Curse of the Black Pearl The Curse of the Werewolf
the program should produce:
1. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb 2. Pirates of the Caribbean: The Curse of the Black Pearl 3. The Curse of the Werewolf 2|e Caribbean: The Curse of the Black Pearl 1|to Stop Worrying and Love the Bomb 2| Pirates of the Caribbean: The Curse of the Black Pearl 2|Pirates of the Caribbean: The Curse of the Black Pearl 3| The Curse of the Werewolf 1| Dr. Strangelove or: How I Learned to Stop Worrying and Love th 1| Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb 1| Learned to Stop Worrying and Love the Bomb 2|bbean: The Curse of the Black Pearl 2| Pirates of the Caribbean: The Curse of the 1|angelove or: How I Learned to Stop Worrying and Love the Bomb 1| Dr. Strangelove or: How I Learned to Stop Worr 3| The Curse of the Werewolf 1|ove or: How I Learned to Stop Worrying and Love the Bomb
The program echos the input list of titles, with index numbers from 1, then produces the index you see, where the titles are listed aligned with each word aligned in the center, and ordered alphabetically. Specifically, a word here is made of alphabetic characters, and we ignore words two characters long or less, and also ignore the and and. The words are sorted case-insensitively, though this example does not particularly show that.

Details

Your program should:
  • Read titles as lines of input from standard input.
  • Echo these lines with numbers starting from 1.
  • Generate a KWIC index in which each line consists of a title number and the title, with the title shifted so that one of the key words is starts in the index position.
  • The index position is 35, zero-based counting from the left.
  • The number should take four positions (0 to 3), with a separator character in position 4.
  • The title should be located in positions 5 to 76. Any characters which would fall left or right of these will be omitted.
  • The titles should be ordered alphabetically by the word in the index column. Order should be case-insensitive.
  • A word is a maximal contiguous group of alphabetic characters.
  • Generate an index line for each word of each title, except those two characters or shorter, or for “and” or “the”.

Approaches

There are several ways to write this program. My solution follows these broad steps:
  1. Loop through the lines of input to read each title, and add them to a container. I used a vector.
  2. Loop through each title. For each title, find each word (see more below) and place each word found into an index. The index must hold each word, and record saying which title contains it, and the position within the title.
  3. Print the numbered list of titles.
  4. Print the contents of the index in proper order, aligning each title as described above.
The first two steps can be two successive loops, or they can be done in a single loop.

If you keep a vector of the titles, your index can be a map from each word to a list of its locations, like this:

map<string,list<pair<int,int>>>
The key string is the word in lower case, and the data is a list of pairs of integers. The first of each pair is the subscript of the titles vector, and the second is the location of the word inside the title. For instance, in the above example, the entry for the key black is a singleton list containing the pair 1, 48. For the key curse, there a list of two pairs 1, 33 and 2, 4.

After filling the index map, go through it using a double loop, outer for the entries (words), then inner through the list associated with each key. The loop body then prints the title line number (subscript plus 1), and then formats the title to line up the word position correctly. The Word Counter example shows how to use the IO manipulators to print a number in a specified width. To align the title, I used substring to separate it into the left part and the right part, then used substring again to print the needed portion, or generated blank padding as required.

Some alternatives include a different sort of index. As an alternative, I solved the problem with a C++ multimap from a string to a pair instead of the map from string to list of pairs.
multimap<string, pair<int,int>>
Since multimap allows keys to repeat, you just insert them multiple times instead of needing the list. The printing now needs only a single loop. Multimap is a bit different, though. Since keys are not unique, it does not have subscripting or an at method.

The map or multimap is an ordered structure, so it puts the keys in order for you. I was also able to solve the problem using a vector of entries for each word, which needed to be sorted before printing. This is easily handled by the standard sort method. The items must hold the word, title position, and word location. This could be done with a pair combining a string and a pair of integers, but I used a C++ tuple, which is much like a pair, but can have any number of entries.

The titles could alternatively be put into a list, and the index could retain an iterator, but you would probably have to also retain the number of the title, since you can't find it out from the iterator, as you can with the list subscript.

General Advice

The program reads in titles one per line, and must find the words in the title. Here is a small program with a function to that. The find_word function takes a string and returns a list of pairs giving the positions of the start and character following each word. Run it to see.

find_word uses the C++ standard pair utility object. This is the same thing that a map contains: just two data items creatively called first and second. As you can see in that code, the easiest way to create one is to use the make_pair function, which simply takes the needed values and returns the pair. Then, when you have one, simply refer to the first and second attributes to recover each part.

C++ has a tolower that returns the lower-case version of a character, but it does not (for some reason) have one to operate on a string. However, you can down-case a string (call it s) using this one simple trick:
for(auto &c: s) { c = tolower(c); }
It won't work if you forget the &.

When your program works, is well-commented and properly indented, submit over the web here.