CSc 422 Assignment 3

Thread On In

Assigned
Due

Oct 20
85 pts
Nov 8

Note: if you need additional or different load libraries (-l option), look at the LDFLAGS setting at the top of Makefile. You can add any needed -l options here.

This project is a small threading exercise. We'll start with a simple program to guess a password by exhaustive enumeration, and convert it use threads. This can make the search faster when multiple cores are available. We'll be using C++ 2011 standard threading. It's a nice interface which retains the pattern of the low-level operating system interface, but makes good use of C++ templates and type system to make it much cleaner than either pthreads or the native win32 interface. We'll use the Unix-style crypt method to encrypt passwords. It will be available on Unix and Linux, and probably on Macs (but see below).

Getting Started

The starting download contains two complete programs, passfile maintains a file of account and encrypted password pairs, and scan1 is a non-threaded password guesser that reads the same format. The file format is simply a series of lines containing an account and encrypted password separated by a colon, the same as the first two fields of a Unix passwd or shadow file. Like this:

orvel:$y$j9T$AYyr5mapeWr6e5WsqXGvF/$oSsFFFehNzicc0mnyEfha7RvTbedQA8DPRKflVQQet8 allen:$y$j9T$Exh6Kh7rEYGP1FlqiOnTg1$gYz/UsKLDbulV6TDICH.HxayLVsQ6uYoPdqwA8nDc.8 subil:$y$j9T$h9eVd48qzyuYITtCaN6q/.$RZGHSCMJdG2T7zVvOcY/5ABpFrYSBF.pyFcHV7KEx85 frank:$y$j9T$d/Zf7lFBS/DRJ22UIhH3G1$EZPYdw34uvj5nS9bMr3shPdh2UN8VqBxZMoaXRssmF1 alvin20:$y$j9T$v6LSdjSm7c0KBn7ROyOox0$8w4hIRCledecQ7Eg6K4iXAeh8QM9VBReyUY1lvZiVT2
The guesser will, in fact, operate on an actual Unix password file, though it won't find any reasonably secure passwords in any reasonable amount of time. We'll be working with other files, which can be created with passfile.

The starting distro will build on Linux and (probably) Mac. Windows folks will have bear with some Unix for this one. You may use Sandbox remotely, fire up WSL, use a VM, install on a flash, or any of several other things. It should also be possible to build libxcrypt or some other crypt on Windows, though I have not tried.

Download the code here: pwdthread.zip or pwdthread.tgz (same content). The code compiles on Linux, and has a Makefile:
bennet@localhost tmp]$ wget http://sandbox.mc.edu/~bennet/cs422b/asst/pwdthread.tgz bennet@localhost tmp]$ tar zfx pwdthread.tgz bennet@localhost tmp]$ cd pwdthread/ bennet@localhost pwdthread]$ ls Makefile enum.h misc.txt pins.pwd scan1.cpp enum.cpp misc.pwd passfile.cpp pins.txt bennet@localhost pwdthread]$ make g++ -g -lcrypt passfile.cpp -o passfile g++ -g -c -o scan1.o scan1.cpp g++ -g -c -o enum.o enum.cpp g++ -o scan1 scan1.o enum.o -lcrypt bennet@localhost pwdthread]$ ls Makefile enum.h misc.pwd passfile pins.pwd scan1 scan1.o enum.cpp enum.o misc.txt passfile.cpp pins.txt scan1.cpp

The above-mentioned executables, passfile, scan1 should build. The download includes two password files, misc.pwd and pins.pwd, and files that can rebuild them, misc.txt and pins.txt. The .pwd files will probably work for you on Linux. Check like this:

[bennet@localhost pwdthread]$ ./passfile misc.pwd check wombat underwater Passwords match. [bennet@localhost pwdthread]$ ./passfile misc.pwd check wombat wrongone Passwords do not match. [bennet@localhost pwdthread]$ ./passfile misc.pwd check smith2 '$nowf1ak3' Passwords match. [bennet@localhost pwdthread]$ ./passfile misc.pwd check smith2 underwater Passwords do not match.
If not, you have a different crypt and can regenerate them with source misc.txt or source pins.txt, which will use whatever crypt the build found. Do that, and repeat the above test. You should get the same results.

The other executable in the download is scan1, which guesses passwords. You might try running it on the pins.pwd, which contains only small passwords of digits, so they can guessed fairly quickly. Still, this will take at least a few seconds to find anything:
[bennet@localhost pwdthread]$ ./scan1 pins.pwd DIGITS 1 10 alvin20: 81 swa: 93 phil: 76 ellen: 221 subil: 381 ^C
Here, I killed it with control-c after it found five of the passwords in the file.

What To Do

The project is simply to use threading to make scan1 faster. To wit:
[bennet@localhost threadpass]$ time ./scan1 pins.pwd DIGITS 1 2 alvin20: 81 swa: 93 phil: 76 real 0m28.826s user 0m19.879s sys 0m8.737s [bennet@localhost threadpass]$ time ./scanN pins.pwd 3 DIGITS 1 2 alvin20: 81 swa: 93 phil: 76 real 0m10.645s user 0m20.685s sys 0m9.236s
The time command measures the running time of a program. Here I have run scan1 to guess digit passwords of length one to two, and it takes about 28.8 seconds to find the three which are encoded in that file. Then, I ran my threaded solution with 3 worker threads, and it was able to find those keys in a bit over 10 seconds.

Evolving To Threads

The existing main function contains a loop (near the bottom) that gets all of the possible password guesses, each of which it sends to the scan function to check if it matches any existing password. This loop uses an enumerator object, part of the download, which generates all possible passwords in the specified character set and range. The code is there for your interest, but for this assignment we can just use it.

First change: move this loop into a separate function, (this will become your thread function). I'll call it scandrive for this discussion, but you can call it what you like. You'll need to send in the password list and the enumerator object. Send the password list by const reference just as it is sent to scan. You might pass the enumerator in the same way, or you could use a global.

Since C++ plain threads must return void, make your scandrive return void. Scandrive should now contain the loop which runs through all the passwords provided by the enumerator and tests each. You also need to return a boolean to the caller that tells if any password was found, so it knows whether to print the final No passwords matched. message. This is a pain since the function needs to be void, so you will need to resort to an additional reference parameter or global to return this value to main. Replace the loop in main with a call to scandrive, and have it collect the success boolean. After all this work, you should have a program that does just the same thing as the one you started with :-). Test it and so that it is.

Next, replace the simple call of scandrive in the main to a single thread execution. Here is an example of calling a function as a thread. Start by including the header thread. Change the scandrive call to be a thread creation, then immediately run join to wait for the thread to finish. Now, for some technical reasons which I'll attempt to explain if you foolishly ask, you can't directly send a non-const reference to a thread function. The simplest thing is to use a utility provided for this: replace any parameter x sent by non-const reference to std::ref(x). (Alternatively, you can use the & operator and send pointers to the objects.) If you don't take care of any reference parameters this way, you will get one of the most opaque error messages the C++ compiler is able to create. And that's saying something.

Compile this version and verify that it works. You will still have a program that does exactly the same thing, but now with a bit more overhead but no speedup because you're only running one thread.

We want to change the program to use multiple threads in order to search faster. Before actually doing that, change the parameter passing logic at the top of main to collect a number of threads after the file name. The first if in main checks that there are five words on the command line. Change that to six, since we're going to add one. Look just below the if where the file name and character set are collected into string variables, and between those collect an integer into a new variable called nthread. The min and max are collected just below, so that's how it's done. Also change the Usage help message (inside the if block) to indicate that the number of threads should appear after the file name. Your solution will now accept the command arguments shown in the threaded execution above, taking a thread count after the file name, which is then ignored. So you still haven't sped anything up.

Since we will be running multiple copies of scandrive at once, we will need to synchronize the data it will share. That would be the enumerator object and (probably) the success flag. The enumerator object will need to be locked when its next method is called. The example linked above shows how to create a mutex object and use lock and unlock to place the data operation in a critical section. There are several ways to apply that here. The example declares the mutex as a global. You can do that, but if you send the enumerator by reference, it is probably not a good idea to let the mutex be global. The two are closely related, and should stay together. You can do both global, or declare the mutex in the main and send both it and the enumerator by reference, but the cleanest way is probably to add it to the enumerator class, so that its next method acquires the lock before getting the next guess, then releases it before returning. (Be careful not to let a return statement miss the unlock.) Alternatively, you could add a new method which calls the existing next under lock and returns the result.

If your success flag is shared, it must be synchronized as as well. You can use an additional mutex, or check out std::atomic. If you have a global flag, it will be shared, likewise if you have a single flag in the main which you send by reference. Sharing can be avoided, but is probably not worth the trouble.

After adding the synchronization, you might make sure your program still compiles and runs, though it still won't be any faster since you still haven't created any extra threads.

Now, time for that. In place of the single thread creation in main, make a loop, and store the threads in some container. Here is a page containing an example (the second code block), which creates 20 simple threads and stores them in a vector. (The & before the name of the function in the thread call is not required, and doesn't change the meaning.) Pay attention and create the number of threads specified in the nthread parameter, not the 20 from the example. Note also the second loop, which iterates through the vector of threads, and performs a join on each one. (The ampersand there has a different meaning, and is needed. Ah, C++!) Make sure you do not write a loop that creates a thread then waits within each iteration. That's not creating any parallelism. Now you should have a threaded program.

If you want to avoid sharing the found flag, you must send a separate one to each thread. This involves creating a vector or array of booleans so you can send one to each thread. (And don't extend such a vector in the body of the thread starting loop, since that may invalidate your references.) Then, after the join, the main must loop through to see if any password was found. This loop must be done after all the joins complete, so only the main is using the values.

When you get your program working, you should be able see good speedup with multiple threads. Use the time command to find out. Speedup will depend on the number of cores on the underlying machine. You probably will get a good return on each additional thread, up to the number of cores, and very little after. (VM's may give screwey results since it will depend on how they map VM threads to hardware cores.)

Submission

When your program works correctly and looks nice, submit it here. Send your threaded scan, and any other file you changed or added.

Crypt on Mac

Here's what I think I know about running on a Mac. This is just from Googling, since I don't have one.

The original Unix crypt is quite obsolete because of the improvement in hardware over the years. Macs are built on a Unix base, and Apple apparently keeps the original crypt function around for compliance with the Posix standard, which requires one to exist. But they don't use it in their software, since it is obsolete. (Posix just requires the function to exist in the API, but doesn't say much about how it works. Apparently, one that always returns an error and does nothing else would be compliant.)

Linux systems generally use an updated crypt, which is actually used to store system credentials. The specific choice would be distro-dependent, but the usual one seems to be something called libxcrypt. For Mac, that library seems available here. I wouldn't be able to tell you how to install it, or how to link your code to it.

But, you should be able to build on a Mac just using the standard crypt. This should, in fact, let your project find keys much faster, since the encryption algorithm is now way too easy. That is sufficient for this assignment, since the threading, and resulting speedup, is of most interest.