MC logo

CSc 231 Assignment 2

^  CSc 231 Perl Assignments

70 pts

Bringing You All The Hits

Due: Mar 9

Write a program to compute and print statistics from the Apache web server log on Sandbox. Here's a run of mine:

bennet 1002%webstats.pl /var/log/httpd/access_log
Hits: 58964
By Request type:
   GET: 58499
   HEAD: 48
   OPTIONS: 2
   POST: 415
By Response Code:
   200: 52444
   206: 54
   301: 78
   304: 4139
   400: 2
   401: 9
   403: 56
   404: 2179
   500: 3
764.0 Hits/Hour
1812 unique visitors.
The Apache web server records records in its access log that look like the following. I've added line breaks for display, but it is actually recorded on a single line:
61.229.83.7 - - [18/Feb/2007:05:35:15 -0600] 
   "GET /~bennet/cs110/textbook/61p12.gif HTTP/1.1" 200 265
   "http://sandbox.mc.edu/~bennet/cs110/textbook/module6_1.html" 
   "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8.0.9) Gecko/20061206 Firefox/1.5.0.9"
The fields, in order, are: Your program should take any number of file names on the command line, and report statistics aggregated from all the files. Report the following: The request type is the first word of the request. It is GET in the above example, which is by far the most common.

The number of unique visitors is just the number of unique client IP addresses. Use a hash to keep track of the ones you've seen before.

For the hits per hour, compute the time period for each file as the difference between the last and first time stamp. Compute that time in hours for each file, total the times, and divide the total hits by the total time.

For computing the time from the timestamp in the file, first extract the parts. You can ignore the last part, which is the time zone, as our server does not relocate during operation. A pattern works well for this, or you can use split. When you get the parts, you can use timelocal to convert the time to a Unix time stamp in seconds. The difference the first and last times will be the length of time covered by the file in seconds. On Sandbox, say man Time::Local to find out how to use this beastie. You may also want to run man localtime and see the description of the fields in struct tm. These are relevant to the ranges of the values.

Generally, hashes are your friends. They're very useful for keeping track of which visitors you have already seen, and for keeping counts by request type or code. You may also find one useful to map month names from the log file to numbers for timelocal.

You may want to study the example program which also reads the log files. It is rather different from this assignment, but has some things in common, and some code you might want to swipe.

The log files on Sandbox are located in /var/log/httpd/. The current one, which the server is writing is access_log, and older ones are access_log.1, access_log.2, etc. On Sandbox, they are set to be publicly readable. If you want to test somewhere else than Sandbox, you can copy a log file, or a portion of one, to another computer for testing. In any case, you should create at least one test file containing just a few lines from the log file. This will allow you to compute the correct answers by hand in order to test your program.

Submission

When your program is properly indented, commented, and works, submit over the web here.