Lab 05 - APIs and Web Scraping

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
import re
import json
import time
from plotnine import *
import warnings
warnings.filterwarnings('ignore')

Learning Goals

Use a real-world API to make queries and process JSON data
Handle API pagination to fetch large datasets
Use web scraping to extract tables from Wikipedia
Use regular expressions to clean scraped data
Merge datasets from different sources
Practice your GitHub skills

Download lab `.qmd` and dataset here

Deliverables

Upload your html and qmd to Quercus by 11:59 pm Wednesday February 4th, 2026.

Lab Description

In this lab, we will practice using APIs and web scraping to obtain and combine movie data from different sources. We will use The Movie Database (TMDB) API to fetch movie information and scrape the Wikipedia page on Academy Award-winning films to get Oscar data, just like in lecture!

Before starting: Sign up for a free TMDB API key at themoviedb.org. You’ll need it for Question 1.

Question 1: Fetching movie data from the TMDB API

The TMDB API allows us to discover movies, get details, and search for films. The base URL is:

https://api.themoviedb.org/3

1a. Set up and discover movies

Replace "YOUR_API_KEY" with your actual TMDB API key. Then use the /discover/movie endpoint to find top-grossing movies.

Refer to the TMDB Discover API docs and fill in the query parameters below:

# TMDB API configuration
BASE_URL = "https://api.themoviedb.org/3"
API_KEY = "YOUR_API_KEY"  # Replace with your API key!

# Discover top-grossing movies
url = f"{BASE_URL}/discover/movie"

# YOUR CODE HERE: Fill in the query parameters
# - Sort by revenue in descending order
# - Start with page 1
params = {
    "api_key": API_KEY,
    "sort_by": "FILL IN",
    "page": "FILL IN"
}

response = requests.get(url, params=params, timeout=30)
print(f"Status Code: {response.status_code}")
print(f"Request URL: {response.url}")

Status Code: 401
Request URL: https://api.themoviedb.org/3/discover/movie?api_key=YOUR_API_KEY&sort_by=FILL+IN&page=FILL+IN

1b. Examine the JSON response

The API returns a JSON object. Parse it and explore its structure.

data = response.json()

# YOUR CODE HERE:
# 1. Print the total number of results (hint: data.get('total_results'))
# 2. Print the total number of pages
# 3. Print how many movies are on this page (hint: len(data.get('results', [])))

1c. Fetch movie details and build a DataFrame

The /discover/movie endpoint returns basic info (id, title, popularity, vote_average), but not budget or revenue. To get those, we need to call the /movie/{id} endpoint for each movie.

First, collect movie IDs from multiple pages of discover results. Then fetch details for each movie. Build a DataFrame with the following columns: movie_id, title, release_date, popularity, revenue, budget, vote_average.

Important: Use time.sleep(0.1) between requests to respect rate limits.

Step 1: Get the movie IDs

import math

# YOUR CODE HERE:
# The discover endpoint returns 20 movies per page.
# Calculate how many pages you need if you want to fetch 100 movies.
# Then loop through the pages, collecting movie IDs.

# MOVIES_COUNT = 100
# MOVIES_PER_PAGE = 20
# n_pages = math.ceil(MOVIES_COUNT / MOVIES_PER_PAGE)

# all_movie_ids = []

# for page in range(1, n_pages+1):
#     params = {
#         "api_key": API_KEY,
#         "sort_by": "revenue.desc",
#         "page": page
#     }
#     # 1. Make the GET request
#     # 2. Parse the JSON response
#     # 3. Extract the 'id' from each movie in results and append to all_movie_ids
#     # 4. Sleep for 0.1 seconds between requests
#     pass

# print(f"Collected {len(all_movie_ids)} movie IDs")

Step 2: For each movie ID, fetch details

# YOUR CODE HERE:
# Loop through all_movie_ids and for each movie:
#   1. GET f"{BASE_URL}/movie/{mid}" with api_key param -> parse JSON
#   2. Append a dict with: movie_id, title, release_date, popularity,
#      revenue, budget, vote_average, genres
#      Hint: use detail.get('key') to safely extract each field
#   3. Sleep 0.25 seconds between requests
#   4. Optionally print progress every 20 movies

# detailed_movies = []
# for i, mid in enumerate(all_movie_ids):
#     ...
#
# df_movies = pd.DataFrame(detailed_movies)
# print(f"Fetched details for {len(df_movies)} movies")
# df_movies.head(10)

To save time, read in the saved dataset of 10,000 movies and do a summary so we know what to clean.

# df_movies = pd.read_csv("movies.csv") 
# df_movies.describe()

1d. Clean the data

Examine the data with df_movies.describe(). Look for variables with unrealistic values (e.g., budget of 0, extreme popularity values). Clean as needed and extract the year from release_date.

# YOUR CODE HERE:
# 1. Examine the data with describe() to identify issues
# 2. Remove movies with unrealistic values (e.g., budget == 0, extreme popularity)
# 3. Extract year from release_date
#    df_movies['year'] = pd.to_datetime(df_movies['release_date']).dt.year
# 4. Print how many movies remain

Question 2: Scraping Oscar data from Wikipedia

Now let’s scrape the List of Academy Award-winning films from Wikipedia, just like in lecture.

2a. Fetch and parse the Wikipedia page

Look at what tables are on this website.

OSCARS_URL = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"

# Include headers to identify ourselves
HEADERS = {
    "User-Agent": "jsc370-class-project/1.0 (educational use)",
    "Accept-Language": "en-US,en;q=0.9",
}

# Fetch the page
oscars_request = requests.get(OSCARS_URL, headers=HEADERS, timeout=30)
print(f"Status code: {oscars_request.status_code}")

# Parse all tables
osc_tables = pd.read_html(StringIO(oscars_request.text))
print(f"Found {len(osc_tables)} tables")

# Examine the tables to find the right one
for i, table in enumerate(osc_tables):
    print(f"\nTable {i}: {table.shape[0]} rows x {table.shape[1]} cols")
    print(f"  Columns: {table.columns.tolist()}")

Status code: 200
Found 2 tables

Table 0: 1387 rows x 4 cols
  Columns: ['Film', 'Year', 'Awards', 'Nominations']

Table 1: 10 rows x 2 cols
  Columns: ['vteAcademy Awards lists', 'vteAcademy Awards lists.1']

2b. Extract the main Oscar winners table

Extract the first table on the website, summarize the data you get.

# YOUR CODE HERE: Select the correct table index based on the output above
# The main table should have columns like: Film, Year, Awards, Nominations

# osc = osc_tables[???]
# print(f"Shape: {osc.shape}")
# osc.head(10)

Check variable types to see if we need to clean.

# YOUR CODE HERE: Check the dtypes of the Oscar table
# Hint: osc.dtypes

2c. Clean the columns using regular expressions

Make Year, Awards, and Nominations numeric. You may need to remove footnotes like [1] and handle strings like 2020/21.

Use regex to extract just the 4-digit Year and integers for Awards and Nominations.

# YOUR CODE HERE:
# For each of Year, Awards, Nominations:
#   1. Convert to string
#   2. Remove wiki-style footnotes like [2], [10] using .str.replace()
#   3. Extract the first number using .str.extract()
#   4. Convert to numeric with pd.to_numeric()

# Verify the dtypes after cleaning

2d. Explore the Oscar data

Do some summary statistics on the scraped Oscar data.

# YOUR CODE HERE:
# 1. Do a general summary with describe()
# 2. How many films are in the Oscar dataset?
# 3. What is the range of years?
# 4. What film has the most Awards?
# 5. What is the average number of nominations?

Question 3: Merging TMDB movies with Oscar data

Now let’s combine the TMDB movie data from Question 1 with the scraped Oscar data from Question 2. We want to do a left join that keeps all TMDB movies and adds Oscar award information where available.

3a. Prepare both datasets for merging

Before merging, make sure the title and year columns are clean and compatible across both datasets.

# YOUR CODE HERE:
# 1. Make sure df_movies has a clean year column
# 2. Check the columns we'll merge on
# 3. Preview what we're merging (number of rows in each)
# 4. Look at a few titles from each to check formatting

3b. Merge the datasets

Perform a left join to keep all TMDB movies and add the Awards and Nominations columns from the Oscar data. Fill missing awards with 0 (movies that didn’t win any Oscars).

# YOUR CODE HERE:
# 1. Left join df_movies with osc on title and year
#    Hint: df_movies has 'title' and 'year', osc has 'Film' and 'Year'
# 2. Fill missing Awards and Nominations with 0
# 3. Create an indicator for whether the movie won any Oscars
# 4. Print summary of the merge (total movies, Oscar winners, non-winners)

3c. Examine the merged data

# YOUR CODE HERE:
# 1. Show the Oscar-winning movies in the dataset
# 2. Compare average budget and revenue for Oscar winners vs non-winners
# 3. Which Oscar-winning movie had the highest revenue? Highest budget?

Question 4: Visualize the results

4a. Compare revenue for Oscar winners vs non-winners

# YOUR CODE HERE: Create a boxplot with plotnine comparing TMDB revenue
# for Oscar winners vs non-winners

Summary:

4b. Scatter plot of budget vs revenue colored by Oscar status

TMDB provides budget data that Wikipedia doesn’t have, so we can explore the budget-revenue relationship.

# YOUR CODE HERE: Create a scatter plot with plotnine
# - x-axis: budget
# - y-axis: revenue
# - Color by won_oscar
# - Consider adding geom_abline(slope=1, intercept=0) to show the break-even line
# - Add labels and title

Summary:

4c. Return on investment

Calculate the return on investment (revenue / budget) and compare Oscar winners vs non-winners.

# YOUR CODE HERE:
# 1. Calculate ROI: merged['roi'] = merged['revenue'] / merged['budget']
# 2. Create a boxplot or histogram comparing ROI for Oscar winners vs non-winners
# 3. What do you observe?

Learning Goals

Download lab .qmd and dataset here

Deliverables

Lab Description

Question 1: Fetching movie data from the TMDB API

1a. Set up and discover movies

1b. Examine the JSON response

1c. Fetch movie details and build a DataFrame

Step 1: Get the movie IDs

Step 2: For each movie ID, fetch details

1d. Clean the data

Question 2: Scraping Oscar data from Wikipedia

2a. Fetch and parse the Wikipedia page

2b. Extract the main Oscar winners table

2c. Clean the columns using regular expressions

2d. Explore the Oscar data

Question 3: Merging TMDB movies with Oscar data

3a. Prepare both datasets for merging

3b. Merge the datasets

3c. Examine the merged data

Question 4: Visualize the results

4a. Compare revenue for Oscar winners vs non-winners

4b. Scatter plot of budget vs revenue colored by Oscar status

4c. Return on investment

Summary:

Download lab `.qmd` and dataset here