User Tools

Site Tools


labs:crawling

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

labs:crawling [2017/01/19 15:22]
sitanne
labs:crawling [2020/08/31 21:03]
Line 1: Line 1:
-====== Crawling ====== 
-Today we're going to learn how to crawl the web. 
-The goal of today'​s lab is that you learn which elements are contained within websites and how to extract this structured information. 
-It's absolutely up to you which programming language you want to use for this lab. However, we suggest you use Python due to it's simplicity and because of the readily available crawling and database connection libraries. 
- 
-Before you start, you may want to read up on the [[http://​www.w3schools.com/​html/​html_basic.asp|basics of HTML]]. Additionally,​ a useful resource that deals with crawling structured content from a website can be found [[http://​web.stanford.edu/​~zlotnick/​TextAsData/​Web_Scraping_with_Beautiful_Soup.html|here]]. This specific guide was written for Python, but similar tools exist for other programming languages as well. 
- 
-===== Exercise 1 - Crawl Academy Awards for Best Actor/​Actress ===== 
- 
-We've prepared a [[http://​pc-10129.ethz.ch/​academyawardnominees/​|website]] that shows a table with all actors and actresses who have been nominated for an Academy Award. Familiarize yourself with the page's source code by using the source inspector of your browser and solve the following exercises. If you need help and cannot google a solution, feel free to ask the assistants. 
- 
-  - Extract all relevant entries from the academy award nominees table on the linked website. 
-  - Design a database schema and create a sqlite database with the collected data. Make sure you give your database the columns: ** year **, ** event **,  **movie**, **actor**, **role** and **won**. Be careful to [[https://​docs.python.org/​2/​library/​sqlite3.html|escape]] all strings when adding rows to the database. 
- 
-===== Exercise 2 - Crawl Rottentomatoes ===== 
- 
-To keep web traffic low and reduce the risk of being blacklisted,​ we have cloned some rottentomatoes pages and are hosting them locally. You can access the detail page through a unique url. Combine the year and movie title like this: http://​10.0.0.1/​m/​year/​title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and backslashes (/) with underline characters (_)). 
- 
-    - Visit any of the local movie sites. Which element contains the [[https://​en.wikipedia.org/​wiki/​Rotten_Tomatoes#​Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score? 
-    - Access each of the cloned websites, extract the tomatometer and the audience score and insert them into the previously created Academy Awards database as additional columns **tomatometer** and ** audience_score **. Some movies are missing on our local server. Also, occasionally,​ you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score. 
- 
-===== Bonus - Crawl Again ===== 
- 
-Scrape any information from a website of your choosing. 
  
labs/crawling.txt ยท Last modified: 2020/08/31 21:03 (external edit)