import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
import re
import json
import time
from plotnine import *
import warnings
warnings.filterwarnings('ignore')Lab 05 - APIs and Web Scraping
Learning Goals
- Use a real-world API to make queries and process JSON data
- Handle API pagination to fetch large datasets
- Use web scraping to extract tables from Wikipedia
- Use regular expressions to clean scraped data
- Merge datasets from different sources
- Practice your GitHub skills
Download lab .qmd and dataset here
Deliverables
Upload your html and qmd to Quercus by 11:59 pm Wednesday February 4th, 2026.
Lab Description
In this lab, we will practice using APIs and web scraping to obtain and combine movie data from different sources. We will use The Movie Database (TMDB) API to fetch movie information and scrape the Wikipedia page on Academy Award-winning films to get Oscar data, just like in lecture!
Before starting: Sign up for a free TMDB API key at themoviedb.org. You’ll need it for Question 1.
Question 1: Fetching movie data from the TMDB API
The TMDB API allows us to discover movies, get details, and search for films. The base URL is:
https://api.themoviedb.org/3
1a. Set up and discover movies
Replace "YOUR_API_KEY" with your actual TMDB API key. Then use the /discover/movie endpoint to find top-grossing movies.
Refer to the TMDB Discover API docs and fill in the query parameters below:
# TMDB API configuration
BASE_URL = "https://api.themoviedb.org/3"
API_KEY = "YOUR_API_KEY" # Replace with your API key!
# Discover top-grossing movies
url = f"{BASE_URL}/discover/movie"
# YOUR CODE HERE: Fill in the query parameters
# - Sort by revenue in descending order
# - Start with page 1
params = {
"api_key": API_KEY,
"sort_by": "FILL IN",
"page": "FILL IN"
}
response = requests.get(url, params=params, timeout=30)
print(f"Status Code: {response.status_code}")
print(f"Request URL: {response.url}")Status Code: 401
Request URL: https://api.themoviedb.org/3/discover/movie?api_key=YOUR_API_KEY&sort_by=FILL+IN&page=FILL+IN
1b. Examine the JSON response
The API returns a JSON object. Parse it and explore its structure.
data = response.json()
# YOUR CODE HERE:
# 1. Print the total number of results (hint: data.get('total_results'))
# 2. Print the total number of pages
# 3. Print how many movies are on this page (hint: len(data.get('results', [])))1c. Fetch movie details and build a DataFrame
The /discover/movie endpoint returns basic info (id, title, popularity, vote_average), but not budget or revenue. To get those, we need to call the /movie/{id} endpoint for each movie.
First, collect movie IDs from multiple pages of discover results. Then fetch details for each movie. Build a DataFrame with the following columns: movie_id, title, release_date, popularity, revenue, budget, vote_average.
Important: Use time.sleep(0.1) between requests to respect rate limits.
Step 1: Get the movie IDs
import math
# YOUR CODE HERE:
# The discover endpoint returns 20 movies per page.
# Calculate how many pages you need if you want to fetch 100 movies.
# Then loop through the pages, collecting movie IDs.
# MOVIES_COUNT = 100
# MOVIES_PER_PAGE = 20
# n_pages = math.ceil(MOVIES_COUNT / MOVIES_PER_PAGE)
# all_movie_ids = []
# for page in range(1, n_pages+1):
# params = {
# "api_key": API_KEY,
# "sort_by": "revenue.desc",
# "page": page
# }
# # 1. Make the GET request
# # 2. Parse the JSON response
# # 3. Extract the 'id' from each movie in results and append to all_movie_ids
# # 4. Sleep for 0.1 seconds between requests
# pass
# print(f"Collected {len(all_movie_ids)} movie IDs")Step 2: For each movie ID, fetch details
# YOUR CODE HERE:
# Loop through all_movie_ids and for each movie:
# 1. GET f"{BASE_URL}/movie/{mid}" with api_key param -> parse JSON
# 2. Append a dict with: movie_id, title, release_date, popularity,
# revenue, budget, vote_average, genres
# Hint: use detail.get('key') to safely extract each field
# 3. Sleep 0.25 seconds between requests
# 4. Optionally print progress every 20 movies
# detailed_movies = []
# for i, mid in enumerate(all_movie_ids):
# ...
#
# df_movies = pd.DataFrame(detailed_movies)
# print(f"Fetched details for {len(df_movies)} movies")
# df_movies.head(10)To save time, read in the saved dataset of 10,000 movies and do a summary so we know what to clean.
# df_movies = pd.read_csv("movies.csv")
# df_movies.describe()1d. Clean the data
Examine the data with df_movies.describe(). Look for variables with unrealistic values (e.g., budget of 0, extreme popularity values). Clean as needed and extract the year from release_date.
# YOUR CODE HERE:
# 1. Examine the data with describe() to identify issues
# 2. Remove movies with unrealistic values (e.g., budget == 0, extreme popularity)
# 3. Extract year from release_date
# df_movies['year'] = pd.to_datetime(df_movies['release_date']).dt.year
# 4. Print how many movies remainQuestion 2: Scraping Oscar data from Wikipedia
Now let’s scrape the List of Academy Award-winning films from Wikipedia, just like in lecture.
2a. Fetch and parse the Wikipedia page
Look at what tables are on this website.
OSCARS_URL = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"
# Include headers to identify ourselves
HEADERS = {
"User-Agent": "jsc370-class-project/1.0 (educational use)",
"Accept-Language": "en-US,en;q=0.9",
}
# Fetch the page
oscars_request = requests.get(OSCARS_URL, headers=HEADERS, timeout=30)
print(f"Status code: {oscars_request.status_code}")
# Parse all tables
osc_tables = pd.read_html(StringIO(oscars_request.text))
print(f"Found {len(osc_tables)} tables")
# Examine the tables to find the right one
for i, table in enumerate(osc_tables):
print(f"\nTable {i}: {table.shape[0]} rows x {table.shape[1]} cols")
print(f" Columns: {table.columns.tolist()}")Status code: 200
Found 2 tables
Table 0: 1387 rows x 4 cols
Columns: ['Film', 'Year', 'Awards', 'Nominations']
Table 1: 10 rows x 2 cols
Columns: ['vteAcademy Awards lists', 'vteAcademy Awards lists.1']
2b. Extract the main Oscar winners table
Extract the first table on the website, summarize the data you get.
# YOUR CODE HERE: Select the correct table index based on the output above
# The main table should have columns like: Film, Year, Awards, Nominations
# osc = osc_tables[???]
# print(f"Shape: {osc.shape}")
# osc.head(10)Check variable types to see if we need to clean.
# YOUR CODE HERE: Check the dtypes of the Oscar table
# Hint: osc.dtypes2c. Clean the columns using regular expressions
Make Year, Awards, and Nominations numeric. You may need to remove footnotes like [1] and handle strings like 2020/21.
Use regex to extract just the 4-digit Year and integers for Awards and Nominations.
# YOUR CODE HERE:
# For each of Year, Awards, Nominations:
# 1. Convert to string
# 2. Remove wiki-style footnotes like [2], [10] using .str.replace()
# 3. Extract the first number using .str.extract()
# 4. Convert to numeric with pd.to_numeric()
# Verify the dtypes after cleaning2d. Explore the Oscar data
Do some summary statistics on the scraped Oscar data.
# YOUR CODE HERE:
# 1. Do a general summary with describe()
# 2. How many films are in the Oscar dataset?
# 3. What is the range of years?
# 4. What film has the most Awards?
# 5. What is the average number of nominations?Question 3: Merging TMDB movies with Oscar data
Now let’s combine the TMDB movie data from Question 1 with the scraped Oscar data from Question 2. We want to do a left join that keeps all TMDB movies and adds Oscar award information where available.
3a. Prepare both datasets for merging
Before merging, make sure the title and year columns are clean and compatible across both datasets.
# YOUR CODE HERE:
# 1. Make sure df_movies has a clean year column
# 2. Check the columns we'll merge on
# 3. Preview what we're merging (number of rows in each)
# 4. Look at a few titles from each to check formatting3b. Merge the datasets
Perform a left join to keep all TMDB movies and add the Awards and Nominations columns from the Oscar data. Fill missing awards with 0 (movies that didn’t win any Oscars).
# YOUR CODE HERE:
# 1. Left join df_movies with osc on title and year
# Hint: df_movies has 'title' and 'year', osc has 'Film' and 'Year'
# 2. Fill missing Awards and Nominations with 0
# 3. Create an indicator for whether the movie won any Oscars
# 4. Print summary of the merge (total movies, Oscar winners, non-winners)3c. Examine the merged data
# YOUR CODE HERE:
# 1. Show the Oscar-winning movies in the dataset
# 2. Compare average budget and revenue for Oscar winners vs non-winners
# 3. Which Oscar-winning movie had the highest revenue? Highest budget?Question 4: Visualize the results
4a. Compare revenue for Oscar winners vs non-winners
# YOUR CODE HERE: Create a boxplot with plotnine comparing TMDB revenue
# for Oscar winners vs non-winnersSummary:
4b. Scatter plot of budget vs revenue colored by Oscar status
TMDB provides budget data that Wikipedia doesn’t have, so we can explore the budget-revenue relationship.
# YOUR CODE HERE: Create a scatter plot with plotnine
# - x-axis: budget
# - y-axis: revenue
# - Color by won_oscar
# - Consider adding geom_abline(slope=1, intercept=0) to show the break-even line
# - Add labels and titleSummary:
4c. Return on investment
Calculate the return on investment (revenue / budget) and compare Oscar winners vs non-winners.
# YOUR CODE HERE:
# 1. Calculate ROI: merged['roi'] = merged['revenue'] / merged['budget']
# 2. Create a boxplot or histogram comparing ROI for Oscar winners vs non-winners
# 3. What do you observe?