Differences

This shows you the differences between two versions of the page.

--- labs:crawling [2017/04/04 11:59]
sitanne
+++ labs:crawling [2020/08/31 21:03] (current)
@@ Line 1: / Line 1: @@
-====== Crawling ======
+====== Web Crawling ======
 Today we're going to learn how to crawl the web.
 The goal of today's lab is that you learn which elements are contained within websites and how to extract this structured information.
@@ Line 15: / Line 15: @@
   - Generate a text file with the information from the website, each entry on a new line.
-===== Exercise 2 - Crawl Rottentomatoes =====
+Hints:
+  * Press ''Ctrl'' + ''Shift'' + ''C'' in Firefox to open the inspector.
+  * Have a look at the available methods in jsoup to select elements: ''getElementById'', ''getElementsByTag'', ''children'', ''select'', etc.
+===== Exercise 2 - Crawl Rotten Tomatoes =====
-To keep web traffic low and reduce the risk of being blacklisted, we have cloned some rottentomatoes pages and are hosting them locally. You can access the detail page through a unique url. Combine the year and movie title like this: http://10.0.0.1/m/year/title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and slashes (/) with underline characters (_)).
+To keep web traffic low and reduce the risk of being blacklisted, we have cloned some Rotten Tomatoes pages and are hosting them locally. You can access the detail page through a unique URL. Combine the year and movie title like this: http://10.0.0.1/m/year/title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and slashes (/) with underline characters (_)).
     - Visit any of the local movie sites. Which element contains the [[https://en.wikipedia.org/wiki/Rotten_Tomatoes#Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score?
-    - Access each of the cloned websites, extract the tomatometer and the audience score and insert them into the previously created Academy Awards database as additional columns **tomatometer** and ** audience_score **. Some movies are missing on our local server. Also, occasionally, you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score.
+    - Access each of the cloned websites and extract the tomatometer and the audience score. Some movies are missing on our local server. Also, occasionally, you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score.
     - Additionally, extract the genre and the runtime for each movie from the the cloned websites.
     - Write the information about the movies into a text file, each movie on a new line.

TIK Fachpraktikum

User Tools

Site Tools

Differences

Page Tools