CSc 220 Assignment 4

I Know I Left That Page Somewhere

Assigned
Due

Mar 20
90 pts
Apr 4
In this project, you will create a class URL to create objects representing Internet URLs. A URL object will be initialized with a URL string, which the object will parse to extract its parts. Our URLs will have four: A protocol, a host name, a port number, and a path. This is a simplification, as full URLs may have several other parts, and we are limiting the protocols to http and https. For instance, this code
URL url("http://www.somesite.com/this/place.html"); cout << url.to_string() << endl; cout << url.get_host() << " " << url.get_path() << endl; url.set_proto("https"); cout << url.to_string() << endl;
will create the object url containing the given location. It then prints it out again using the to_string method, then prints separately the host, www.somesite.com, and the path /this/place.html. Finally, it changes the protocol to https, and prints https://www.somesite.com/this/place.html.

Once the url is created, there are methods for getting and setting its parts, and for other sorts of changes.

Requirements

Your class should implement the interface following, and satisfy the other requirements in this section. The assignment is to implement the class, not to build code that uses it. There is a test driver attached to exercise your class.

Interface

URL u(urlstring)
Your URL class should have a single constructor which takes a string representing a URL. It initializes the object u with the protocol, host name, port number and path given therein. Details of the argument string format are given below. If its format is not correct, throw the exception std::invalid_argument.
u.get_proto()
u.get_host()
u.get_port()
u.get_path()
For any URL object u, there is a getter for each of the parts. Each method returns a string, except for get_port(), which returns an integer. Each of these methods should be declared const, meaning it does not change u. For instance, get_proto() should be declared something like this:
string get_proto() const { ... }
u.set_proto(protostring)
Set the protocol in URL object u to protostring. The argument is type string, which must be either "http" or "https". If you send any other string, the method should throw std::invalid_argument.
u.set_host(hostname)
Set the host name to the given string.
u.set_port(portno)
Set the port number to the given integer portno. The value of the port number must be in the range 1 to 65535, or the program should throw the std::invalid_argument exception.
u.set_path(path)
Set the path. Argument is a string. The path should start with a slash; go ahead and add one if there is none. If a non-empty path ends with a slash, you may discard that.
u.to_string()
Return a string representation of the URL. This will generally look like the constructor argument, after any changes made by the setters or other manipulating methods. The port number should appear only when it differs from the default port corresponding to the protocol. This method should also be declared const.
u.up()
Remove the last component of the path, if there is one. If the path is empty, does nothing.
u.up(n)
Runs the first up n times. If n is negative, does nothing.
u.rel(newurl)
This takes a string newurl and applies it to u. This is similar to what happens when you click on a link in a browser. The newurl is a link you click on, and u is the location of the page you are leaving. This method changes u to the new location. There are three cases.
  • If newurl starts with a protocol name and a colon, this replaces the contents of the u under the same rules as the constructor.
  • If newurl starts with a slash, this method behaves just like set_path.

  • Otherwise, this method uses newurl to extend the existing path: the last component of u (if any) is removed (much like up()), then the components of newurl are added to the end of it. For instance,
    URL u("https://www.over.here/come/see/this.html"); u.rel("the/elephant.html");
    The resulting location is https://www.over.here/come/see/the/elephant.html.
    Note also that leading dot components in the extension can remove components from the original path. (See the discussion on paths below.) So
    URL u("https://www.over.here/come/see/this.html"); u.rel("../look/here.html");
    gives https://www.over.here/come/look/here.html.

URL Syntax

The constructor accepts a string representing a URL having the following syntax:
protocol⟩://⟨hostname⟩[:⟨port⟩][⟨path⟩]
where
protocolThe protocol name. For this project, we consider only http or https.
hostnameThe server host name.
portThe port number, which is optional and usually omitted along with the colon marker. (The brackets in the syntax indicate that the colon and port number are optional, they are not part of the string.) The port is a number used to identify the service. If the port is omitted, the default depends on the protocol. For http, it is 80; for For https, 443. If the port is given, it must be in the range 1 through 65535.
pathThe url path. This may also be omitted, in which case a single slash is assumed. If the path is not empty, it must start with a slash. If it is non-empty and ends with a slash, you may ignore the trailing slash.

When building the return value of to_string, construct a string of the same format. Omit the port number unless it differs from the default value for the protocol. The path should always start with a slash; if it is empty, represent it as a single slash.

About URL Paths

The path of the URL is series of components, separated by slashes in the input. When processing a string representing a URL, if it has empty components, they are discarded. (Another way to say that is that a group of slashes together is equivalent to a single slash.) A component of a single dot (period) is also discarded. A component of two dots is special. It is not only discarded itself, but it removes the component before it (if there is one). The following all denote the same URL:
http://host.com/this/path/here http://host.com/this//path/./here http://host.com/../this/other/../path/here/also/.. http://host.com/this//path/and/././the/stuff/../../../here
You must resolve empty, . and .. components on input. Do not generate such components on output.

The strangeness of dots in URLs is a design feature from Unix file system paths which was adopted into the URL format. So that whom you blame.

Additional Requirements and Details

Some Advice

You might want to start by looking over the test driver. It may clarify how some of the methods operate. You might want to start by creating a URL class with stubs so you can get a running program. Stubs are just versions of the methods which have the correct signature, but have empty bodies or just return zero or the empty string.

You will need to create four private data fields to store the four parts (protocol, host name, port and path). It's quite possible to keep the path as a string and scan for slashes whenever it is needed to exact an individual component. I found it easier to store the path as a std::list of strings, where each string is a component. I split the path up when it is created, then don't have to worry about scanning for slashes again. I can also resolve dot components and discard empty ones when I make the list, and not worry about them again.

Your methods will never need to print anything. They provide service to some client, which will do any required printing. You may well find it useful use temporary debugging prints, but if you you think you need prints to implement any of the methods, you've missed something.

I wrote several private helper methods. One thing you should consider is creating a private method that does all the work of the constructor. The actual constructor body then becomes a single-statement call to this private method. The reason for this is the first case of the rel method. It is a work-around for the fact that C++ will not allow you call the constructor as a regular method.

Since I used a list of strings to store my path, I found it useful to have a method that splits a string by its slashes and puts them into a list. I use this in the constructor implementation, set_path, and rel, and it's very nice not to have to write it three times.

You may find other code repeated between the constructor and setters. Some small private functions can make this easier, and improve the quality of the code.

Feel free to combine the two versions of up using a default parameter value.

What We Are Ignoring

This assignment simplifies URLs in a number of ways. For instance, most any protocol can be used, rather than the two allowed. URLs also allow for embedded credentials, form data, and certainly other things. We're not doing that, either.

URLs use a fairly limited character set, with a coding scheme to represent characters not otherwise allowed. Host names are have even more limited character sets (unless internationalized names are used, with its own separate coding scheme). We are just sort of assuming that the caller is not violating any character set rules. A more sophisticated URL class would check these things and/or perform appropriate recoding.

Another detail is that we discard the slash at the end of a non-empty path. This is usually safe, but ignores some subtleties.

Submission

When your program works and is properly formatted and commented, submit it online here.