This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
labs:crawling [2017/04/04 11:59] sitanne |
labs:crawling [2020/08/31 21:03] (current) |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Crawling ====== | + | ====== Web Crawling ====== |
| Today we're going to learn how to crawl the web. | Today we're going to learn how to crawl the web. | ||
| The goal of today's lab is that you learn which elements are contained within websites and how to extract this structured information. | The goal of today's lab is that you learn which elements are contained within websites and how to extract this structured information. | ||
| Line 15: | Line 15: | ||
| - Generate a text file with the information from the website, each entry on a new line. | - Generate a text file with the information from the website, each entry on a new line. | ||
| - | ===== Exercise 2 - Crawl Rottentomatoes ===== | + | Hints: |
| + | * Press ''Ctrl'' + ''Shift'' + ''C'' in Firefox to open the inspector. | ||
| + | * Have a look at the available methods in jsoup to select elements: ''getElementById'', ''getElementsByTag'', ''children'', ''select'', etc. | ||
| + | ===== Exercise 2 - Crawl Rotten Tomatoes ===== | ||
| - | To keep web traffic low and reduce the risk of being blacklisted, we have cloned some rottentomatoes pages and are hosting them locally. You can access the detail page through a unique url. Combine the year and movie title like this: http://10.0.0.1/m/year/title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and slashes (/) with underline characters (_)). | + | To keep web traffic low and reduce the risk of being blacklisted, we have cloned some Rotten Tomatoes pages and are hosting them locally. You can access the detail page through a unique URL. Combine the year and movie title like this: http://10.0.0.1/m/year/title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and slashes (/) with underline characters (_)). |
| - Visit any of the local movie sites. Which element contains the [[https://en.wikipedia.org/wiki/Rotten_Tomatoes#Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score? | - Visit any of the local movie sites. Which element contains the [[https://en.wikipedia.org/wiki/Rotten_Tomatoes#Tomatometer_critic_aggregate_score|tomatometer]] score of the movie? Which element contains the audience score? | ||
| - | - Access each of the cloned websites, extract the tomatometer and the audience score and insert them into the previously created Academy Awards database as additional columns **tomatometer** and ** audience_score **. Some movies are missing on our local server. Also, occasionally, you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score. | + | - Access each of the cloned websites and extract the tomatometer and the audience score. Some movies are missing on our local server. Also, occasionally, you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score. |
| - Additionally, extract the genre and the runtime for each movie from the the cloned websites. | - Additionally, extract the genre and the runtime for each movie from the the cloned websites. | ||
| - Write the information about the movies into a text file, each movie on a new line. | - Write the information about the movies into a text file, each movie on a new line. | ||