Table of Contents

Web Crawling

Today we're going to learn how to crawl the web. The goal of today's lab is that you learn which elements are contained within websites and how to extract this structured information. It's absolutely up to you which programming language you want to use for this lab. However, we suggest you use Java (or Python).

Before you start, you may want to read up on the basics of HTML. Additionally, a useful resource that deals with crawling structured content from a website can be found here. This specific guide was written for Python, but similar tools exist for other programming languages as well. In Java you can use jsoup to fetch and analyze the web pages. The jsoup documentation explains how you can navigate a document.

We suggest you use Eclipse to program in Java. It is already installed on the computers. You can start it by typing eclipse4 & in the terminal.

Exercise 1 - Crawl Academy Awards for Best Actor/Actress

We've prepared a website that shows a table with all actors and actresses who have been nominated for an Academy Award. Familiarize yourself with the page's source code by using the source inspector of your browser and solve the following exercises by writing a program/script. If you need help and cannot google a solution, feel free to ask the assistants.

  1. Extract all relevant entries from the academy award nominees table on the linked website.
  2. Generate a text file with the information from the website, each entry on a new line.

Hints:

Exercise 2 - Crawl Rotten Tomatoes

To keep web traffic low and reduce the risk of being blacklisted, we have cloned some Rotten Tomatoes pages and are hosting them locally. You can access the detail page through a unique URL. Combine the year and movie title like this: http://10.0.0.1/m/year/title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and slashes (/) with underline characters (_)).

  1. Visit any of the local movie sites. Which element contains the tomatometer score of the movie? Which element contains the audience score?
  2. Access each of the cloned websites and extract the tomatometer and the audience score. Some movies are missing on our local server. Also, occasionally, you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score.
  3. Additionally, extract the genre and the runtime for each movie from the the cloned websites.
  4. Write the information about the movies into a text file, each movie on a new line.

Bonus - Crawl Again

Scrape any information from a website of your choosing.