User Tools

Site Tools


labs:crawling

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
labs:crawling [2017/01/19 17:35]
sitanne [Exercise 2 - Crawl Rottentomatoes]
labs:crawling [2020/08/31 21:03] (current)
Line 1: Line 1:
-====== Crawling ======+====== ​Web Crawling ======
 Today we're going to learn how to crawl the web. Today we're going to learn how to crawl the web.
 The goal of today'​s lab is that you learn which elements are contained within websites and how to extract this structured information. The goal of today'​s lab is that you learn which elements are contained within websites and how to extract this structured information.
-It's absolutely up to you which programming language you want to use for this lab. However, we suggest you use Python ​due to it's simplicity and because of the readily available crawling and database connection libraries.+It's absolutely up to you which programming language you want to use for this lab. However, we suggest you use Java (or Python).
  
 Before you start, you may want to read up on the [[http://​www.w3schools.com/​html/​html_basic.asp|basics of HTML]]. Additionally,​ a useful resource that deals with crawling structured content from a website can be found [[http://​web.stanford.edu/​~zlotnick/​TextAsData/​Web_Scraping_with_Beautiful_Soup.html|here]]. This specific guide was written for Python, but similar tools exist for other programming languages as well. In Java you can use jsoup to fetch and analyze the web pages. The [[https://​jsoup.org/​cookbook/​extracting-data/​dom-navigation|jsoup documentation]] explains how you can navigate a document. Before you start, you may want to read up on the [[http://​www.w3schools.com/​html/​html_basic.asp|basics of HTML]]. Additionally,​ a useful resource that deals with crawling structured content from a website can be found [[http://​web.stanford.edu/​~zlotnick/​TextAsData/​Web_Scraping_with_Beautiful_Soup.html|here]]. This specific guide was written for Python, but similar tools exist for other programming languages as well. In Java you can use jsoup to fetch and analyze the web pages. The [[https://​jsoup.org/​cookbook/​extracting-data/​dom-navigation|jsoup documentation]] explains how you can navigate a document.
 +
 +We suggest you use Eclipse to program in Java. It is already installed on the computers. You can start it by typing ''​eclipse4 &''​ in the terminal.
  
 ===== Exercise 1 - Crawl Academy Awards for Best Actor/​Actress ===== ===== Exercise 1 - Crawl Academy Awards for Best Actor/​Actress =====
  
-We've prepared a [[http://​pc-10129.ethz.ch/​academyawardnominees/​|website]] that shows a table with all actors and actresses who have been nominated for an Academy Award. Familiarize yourself with the page's source code by using the source inspector of your browser and solve the following exercises. If you need help and cannot google a solution, feel free to ask the assistants.+We've prepared a [[http://​pc-10129.ethz.ch/​academyawardnominees/​|website]] that shows a table with all actors and actresses who have been nominated for an Academy Award. Familiarize yourself with the page's source code by using the source inspector of your browser and solve the following exercises ​by writing a program/​script. If you need help and cannot google a solution, feel free to ask the assistants.
  
   - Extract all relevant entries from the academy award nominees table on the linked website.   - Extract all relevant entries from the academy award nominees table on the linked website.
-  - Design ​database schema and create a sqlite database ​with the collected data. Make sure you give your database ​the columns: ** year **** event **,  **movie**, **actor**, **role** and **won**. Be careful to [[https://​docs.python.org/​2/​library/​sqlite3.html|escape]] all strings when adding rows to the database.+  - Generate ​text file with the information from the websiteeach entry on a new line.
  
-===== Exercise 2 - Crawl Rottentomatoes ​=====+Hints: 
 +  * Press ''​Ctrl''​ + ''​Shift''​ + ''​C''​ in Firefox to open the inspector. 
 +  * Have a look at the available methods in jsoup to select elements: ''​getElementById'',​ ''​getElementsByTag'',​ ''​children'',​ ''​select'',​ etc. 
 +===== Exercise 2 - Crawl Rotten Tomatoes ​=====
  
-To keep web traffic low and reduce the risk of being blacklisted,​ we have cloned some rottentomatoes ​pages and are hosting them locally. You can access the detail page through a unique ​url. Combine the year and movie title like this: http://​10.0.0.1/​m/​year/​title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and backslashes ​(/) with underline characters (_)).+To keep web traffic low and reduce the risk of being blacklisted,​ we have cloned some Rotten Tomatoes ​pages and are hosting them locally. You can access the detail page through a unique ​URL. Combine the year and movie title like this: http://​10.0.0.1/​m/​year/​title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and slashes ​(/) with underline characters (_)).
  
     - Visit any of the local movie sites. Which element contains the [[https://​en.wikipedia.org/​wiki/​Rotten_Tomatoes#​Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score?     - Visit any of the local movie sites. Which element contains the [[https://​en.wikipedia.org/​wiki/​Rotten_Tomatoes#​Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score?
-    - Access each of the cloned websitesextract the tomatometer and the audience score and insert them into the previously created Academy Awards database as additional columns **tomatometer** and ** audience_score **. Some movies are missing on our local server. Also, occasionally,​ you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score. +    - Access each of the cloned websites ​and extract the tomatometer and the audience score. Some movies are missing on our local server. Also, occasionally,​ you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score. 
-    - Extract ​the first four actors ​for each movie from the the cloned websites. +    - Additionally,​ extract the genre and the runtime ​for each movie from the the cloned websites. 
-    - Write the information about the movie into a text file, each movie on a new line.+    - Write the information about the movies ​into a text file, each movie on a new line.
  
 ===== Bonus - Crawl Again ===== ===== Bonus - Crawl Again =====
labs/crawling.1484843725.txt.gz · Last modified: 2020/08/31 21:03 (external edit)