User Tools

Site Tools


labs:crawling

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
labs:crawling [2017/04/04 11:59]
sitanne
labs:crawling [2020/08/31 21:03] (current)
Line 1: Line 1:
-====== Crawling ======+====== ​Web Crawling ======
 Today we're going to learn how to crawl the web. Today we're going to learn how to crawl the web.
 The goal of today'​s lab is that you learn which elements are contained within websites and how to extract this structured information. The goal of today'​s lab is that you learn which elements are contained within websites and how to extract this structured information.
Line 15: Line 15:
   - Generate a text file with the information from the website, each entry on a new line.   - Generate a text file with the information from the website, each entry on a new line.
  
-===== Exercise 2 - Crawl Rottentomatoes ​=====+Hints: 
 +  * Press ''​Ctrl''​ + ''​Shift''​ + ''​C''​ in Firefox to open the inspector. 
 +  * Have a look at the available methods in jsoup to select elements: ''​getElementById'',​ ''​getElementsByTag'',​ ''​children'',​ ''​select'',​ etc. 
 +===== Exercise 2 - Crawl Rotten Tomatoes ​=====
  
-To keep web traffic low and reduce the risk of being blacklisted,​ we have cloned some rottentomatoes ​pages and are hosting them locally. You can access the detail page through a unique ​url. Combine the year and movie title like this: http://​10.0.0.1/​m/​year/​title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and slashes (/) with underline characters (_)).+To keep web traffic low and reduce the risk of being blacklisted,​ we have cloned some Rotten Tomatoes ​pages and are hosting them locally. You can access the detail page through a unique ​URL. Combine the year and movie title like this: http://​10.0.0.1/​m/​year/​title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and slashes (/) with underline characters (_)).
  
     - Visit any of the local movie sites. Which element contains the [[https://​en.wikipedia.org/​wiki/​Rotten_Tomatoes#​Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score?     - Visit any of the local movie sites. Which element contains the [[https://​en.wikipedia.org/​wiki/​Rotten_Tomatoes#​Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score?
-    - Access each of the cloned websitesextract the tomatometer and the audience score and insert them into the previously created Academy Awards database as additional columns **tomatometer** and ** audience_score **. Some movies are missing on our local server. Also, occasionally,​ you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score.+    - Access each of the cloned websites ​and extract the tomatometer and the audience score. Some movies are missing on our local server. Also, occasionally,​ you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score.
     - Additionally,​ extract the genre and the runtime for each movie from the the cloned websites.     - Additionally,​ extract the genre and the runtime for each movie from the the cloned websites.
     - Write the information about the movies into a text file, each movie on a new line.     - Write the information about the movies into a text file, each movie on a new line.
labs/crawling.1491299963.txt.gz · Last modified: 2020/08/31 21:03 (external edit)