Eventually, we want to be able to crawl the web and parse web pages. As such, we must be able to find the links embedded in the HTML code of the web pages. This assignment is a step towards this functionality. Specifically, you must implement the ArrayList<URL> listLinks(URL base, String html)
method to return a list of HTTP or HTTPS links from the HTML a
anchor tag href
attribute. This should not include links in the href
attribute of the link
tag!
You have been provided a URL clean(URL url)
method that will make sure URLs are in a semi-consistent properly-encoded form. You will need to figure out how to deal with relative vs absolute links, however. See below for more.
We do not explicitly cover HTML in this class. However, there are many helpful resources on the web for more about this simple markup language. Some resources include:
You will need to be familiar with the anchor tag <a>
for this assignment. Some resources include:
- MDN - : The Anchor element
- MDN - What is a URL
- The Java Tutorials - Creating a URL
The anchor tag is used to create links on web pages. For example:
<a href="https://www.cs.usfca.edu/">USF CS</a>
The above code will generate the link USF CS, where the link text is USF CS
and the link destination is https://www.cs.usfca.edu/
. The link will always be placed in the href
attribute of the a
tag, but not all a
tags will have this attribute. For example, this is a valid a
tag without the href
attribute:
<a name="home" class="bookmark">Home</a>
And, the href
attribute may appear in other tags. For example, this is a valid link
tag to include a style sheet:
<link rel="stylesheet" type="text/css" href="style.css">
The majority of URLs on webpages are relative (i.e. specified relative to the current webpage URL). You will need to convert those relative URLs into an absolute URL. For this, you may use the java.net.URL
class. For example, consider the following:
URL base = new URL("http://www.cs.usfca.edu/~sjengle/cs212/");
URL absolute = new URL(base, "../index.html");
// outputs http://www.cs.usfca.edu/~sjengle/index.html
System.out.println(absolute);
This works even if the provided string was already absolute. For example:
URL base = new URL("http://www.cs.usfca.edu/~sjengle/cs212/");
URL absolute = new URL(base, "http://www.example.com/");
// outputs http://www.example.com/
System.out.println(absolute);
Because of this, you do not need to test if a link was relative or absolute. You can simply always use the above code.
The official name of this homework is LinkParser
. This should be the name you use for your Eclipse Java project and the name you use when running the homework test script.
See the Homework Guides for additional details on homework requirements and submission.