User Tools

Site Tools


labs:crawling

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
labs:crawling [2017/04/04 12:00]
sitanne
labs:crawling [2020/08/31 21:03] (current)
Line 1: Line 1:
-====== Crawling ======+====== ​Web Crawling ======
 Today we're going to learn how to crawl the web. Today we're going to learn how to crawl the web.
 The goal of today'​s lab is that you learn which elements are contained within websites and how to extract this structured information. The goal of today'​s lab is that you learn which elements are contained within websites and how to extract this structured information.
Line 15: Line 15:
   - Generate a text file with the information from the website, each entry on a new line.   - Generate a text file with the information from the website, each entry on a new line.
  
 +Hints:
 +  * Press ''​Ctrl''​ + ''​Shift''​ + ''​C''​ in Firefox to open the inspector.
 +  * Have a look at the available methods in jsoup to select elements: ''​getElementById'',​ ''​getElementsByTag'',​ ''​children'',​ ''​select'',​ etc.
 ===== Exercise 2 - Crawl Rotten Tomatoes ===== ===== Exercise 2 - Crawl Rotten Tomatoes =====
  
Line 20: Line 23:
  
     - Visit any of the local movie sites. Which element contains the [[https://​en.wikipedia.org/​wiki/​Rotten_Tomatoes#​Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score?     - Visit any of the local movie sites. Which element contains the [[https://​en.wikipedia.org/​wiki/​Rotten_Tomatoes#​Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score?
-    - Access each of the cloned websitesextract the tomatometer and the audience score and insert them into the previously created Academy Awards database as additional columns **tomatometer** and ** audience_score **. Some movies are missing on our local server. Also, occasionally,​ you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score.+    - Access each of the cloned websites ​and extract the tomatometer and the audience score. Some movies are missing on our local server. Also, occasionally,​ you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score.
     - Additionally,​ extract the genre and the runtime for each movie from the the cloned websites.     - Additionally,​ extract the genre and the runtime for each movie from the the cloned websites.
     - Write the information about the movies into a text file, each movie on a new line.     - Write the information about the movies into a text file, each movie on a new line.
labs/crawling.1491300031.txt.gz ยท Last modified: 2020/08/31 21:03 (external edit)