Name that Movie
Today we're going to learn how to crawl the web and store the collected information in a database.
The first goal of today's lab is that you learn which elements are contained within websites and how to extract this structured information. The second goal is that you learn how to formulate and use Structured Query Language (SQL) statements to answer high level questions about data. In today's course we will work with SQLite3 engines.
It's absolutely up to you which programming language you want to use for this lab. However, we suggest you use Python due to it's simplicity and because of the readily available crawling and database connection libraries.
Before you start, you may want to read up on the basics of HTML. Additionally, a useful resource that deals with crawling structured content from a website can be found here. This specific guide was written for Python, but similar tools exist for other programming languages as well. If you are new to SQL, you may want to check out this useful collection of online resources.
Exercise 1 - Crawl Academy Awards for Best Actor/Actress
We've prepared a website that shows a table with all actors and actresses who have been nominated for an Academy Award. Familiarize yourself with the page's source code by using the source inspector of your browser and solve the following exercises. If you need help and cannot google a solution, feel free to ask the assistants.
Extract all relevant entries from the academy award nominees table on the linked website.
Design a database schema and create a sqlite database with the collected data. Make sure you give your database the columns:
year ,
event ,
movie,
actor,
role and
won. Be careful to
escape all strings when adding rows to the database.
Exercise 2 - Query the Academy Awards Database
Answer the following questions by querying the database you've created in Exercise 1. If you need help and cannot google a solution, feel free to ask the assistants.
Which actor and which actress have won the first ever Academy Award for Best Actor/Actress (in our dataset)?
Which actor and which actress have won the most Academy Awards?
Which actor and which actress have been nominated for the most Academy Awards?
Which actor and which actress have received the most nominations without winning a single Academy Award?
List the shortest and the longest movie title you've collected.
Exercise 3 - Crawl Rottentomatoes
To keep web traffic low and reduce the risk of being blacklisted, we have cloned some rottentomatoes pages and are hosting them locally. You can access the detail page through a unique url. Combine the year and movie title like this: http://10.0.0.1/m/year/title to access the local clone of the movie detail page. (Transform the movie title to lower case. Remove any apostrophe characters (') and replace spaces and backslashes (/) with underline characters (_)).
Visit any of the local movie sites. Which element contains the
tomatometer score of the movie? Which element contains the audience score?
Access each of the cloned websites, extract the tomatometer and the audience score and insert them into the previously created Academy Awards database as additional columns tomatometer and audience_score . Some movies are missing on our local server. Also, occasionally, you'll see movies that don't have a tomatometer score. Think about how you want to handle such a missing movie page or tomatometer score.
Exercise 4 - Query the IMDB Database
We've collected some of information about movies from various sources and have compiled a database with a number of tables. You can find the sqlite here. Familiarize yourself with the schema and contents and answer the following questions. If you need help and cannot google a solution, feel free to ask the assistants.
The creator of the database was sloppy and accidentally entered some movies twice. How can you find out which? Remove them from the database! Make sure to also remove the dependent foreign key constraints from other tables.
How many of the movies you crawled in the first exercise are already in the IMDB db? Which are missing?
How many of the actors you crawled in the first exercise are already in the IMDB db? Do you notice a problem?
What is the most frequently occurring movie keyword?
Choose 5 keywords and find out how many movies match with each.
Display the top 5 most frequent movie keywords and how often they occur.
List all directors who have produced more than 10 movies yet still retain an average tomatometer score of over 90% over all their movies combined.
List all movies that reference 'The Matrix'?
List all the keywords that describe 'The Matrix'.
Which keywords do the movies 'The Matrix' and 'The Matrix Reloaded' have in common?
List all actors who appear in 'The Matrix' along with the total number of movies they appear in including the tomatometer score.
Which actor or actress who starred in at least 3 movies since 2000 has the overall best or worst average tomatometer?
List all occurring tomatometer values, how many times they're seen in the movie table and calculate the percentage of occurrence with respect to all table rows.
Exercise 5 - Plot some things
In this exercise you will learn how to plot some things. You may use any software to accomplish the following tasks. If you're unsure what to do or need help and cannot google a solution, feel free to ask the assistants.
Use a plot to decide upon the truth of the following statement: “The movie title length is a direct indicator of the tomatometer score of the movie”
Plot Robert De Niro's career trajectory visualizing the year on the x-axis and the average tomatometer score on the y-axis.
Plot the average tomatometer score over all movies by year.
Plot the average tomatometer score per year of the top five actors and the top five actresses. (Rank the actors according to their overall average tomatometer score. Add a legend to the plot with the names of the actors)
Bonus - Crawl Again
Scrape any information from a website of your choosing.