Today we're going to learn how to crawl the web. The goal of today's lab is that you learn which elements are contained within websites and how to extract this structured information. It's absolutely up to you which programming language you want to use for this lab. However, we suggest you use Java (or Python).
Before you start, you may want to read up on the basics of HTML. Additionally, a useful resource that deals with crawling structured content from a website can be found here. This specific guide was written for Python, but similar tools exist for other programming languages as well. In Java you can use jsoup to fetch and analyze the web pages. The jsoup documentation explains how you can navigate a document.
We suggest you use Eclipse to program in Java. It is already installed on the computers. You can start it by typing eclipse4 &
in the terminal.
We've prepared a website that shows a table with all actors and actresses who have been nominated for an Academy Award. Familiarize yourself with the page's source code by using the source inspector of your browser and solve the following exercises by writing a program/script. If you need help and cannot google a solution, feel free to ask the assistants.
Hints:
Ctrl
+ Shift
+ C
in Firefox to open the inspector.getElementById
, getElementsByTag
, children
, select
, etc.To keep web traffic low and reduce the risk of being blacklisted, we have cloned some Rotten Tomatoes pages and are hosting them locally. You can access the detail page through a unique URL. Combine the year and movie title like this: http://10.0.0.1/m/year/title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and slashes (/) with underline characters (_)).
Scrape any information from a website of your choosing.