JSC 370: Data Science II

Week 5: Scraping and APIs

What is a Web API?

A Web API is an application programming interface for either a web server or a web browser. – Wikipedia

Examples:

How do APIs work?

HTTP Methods:

  • GET: Request data from a server
  • POST: Send data to a server
  • PUT: Update existing data
  • DELETE: Remove data

We’ll use the requests library in Python to interact with APIs.

import requests
response = requests.get("https://api.example.com/data")

Additional setup for APIs

In addition to requests we also need pandas (for all data science stuff!) and json

import requests
import pandas as pd
import json

Structure of a URL

URL Structure

Source: HTTP: The Protocol Every Web Developer Must Know

Structure of an API URL

https://api.themoviedb.org/3/discover/movie?api_key=abc123&sort_by=revenue.desc&page=1
\_____/\__________________/\_____________/\__________________________________________/
   |            |                 |                            |
Protocol    Base URL         Endpoint               Query Parameters
                              (Path)                  (key=value pairs)

Components:

Part Example Purpose
Protocol https:// Secure connection
Base URL api.themoviedb.org/3 API server + version
Endpoint /discover/movie Which resource to access
Query Params ?api_key=...&sort_by=... Filter/customize results

What is JSON?

JSON (JavaScript Object Notation) is the standard format for API responses.

{
  "name": "Avatar",
  "year": 2009,
  "budget": 237000000,
  "genres": ["Action", "Adventure", "Sci-Fi"],
  "awards": {
    "oscars": 3,
    "nominations": 9
  }
}

Key features:

  • Objects: { } contain key-value pairs
  • Arrays: [ ] contain ordered lists
  • Values: strings, numbers, booleans, null, objects, or arrays
  • Human-readable and easy to parse in Python with .json()

Web APIs with curl

Under-the-hood, the requests library sends HTTP requests similar to curl:

curl -X GET "https://api.github.com/users/octocat" \
     -H "Accept: application/json"

Response (JSON):

{
  "login": "octocat",
  "id": 583231,
  "type": "User",
  "name": "The Octocat",
  "company": "@github",
  "location": "San Francisco",
  "public_repos": 8,
  "followers": 16000
}

The -X GET specifies the HTTP method, -H adds headers.

HTTP Status Codes

Remember these HTTP codes:

  • 1xx: Information message
  • 2xx: Success (200 = OK)
  • 3xx: Redirection (301 = Moved Permanently)
  • 4xx: Client error (404 = Not Found)
  • 5xx: Server error (500 = Internal Server Error)

API Keys and Tokens

Sometimes APIs require authentication:

  • API Key/Token: Passed in header or URL
  • OAuth: More complex authentication flow
  • Basic Auth: Username/password

Example 1: The Movie Database API to discover popular movies and their details.

Get your free API key at: themoviedb.org

Example 2: NOAA Climate Data API (data we used in earlier weeks)

Get your token at: https://www.ncdc.noaa.gov/cdo-web/token

Web API Example 1: The Movie Database (TMDB)

Using TMDB

# TMDB API configuration
BASE_URL = "https://api.themoviedb.org/3"
API_KEY = "demo"  # Replace with your API key!

TMDB API: Discovering Movies

# Discover top-grossing movies from 2020-2024
url = f"{BASE_URL}/discover/movie"

params = {
    "api_key": API_KEY,
    "sort_by": "revenue.desc",
    "release_date.gte": "2020-01-01",
    "release_date.lte": "2024-12-31",
    "page": 1
}

response = requests.get(url, params=params, timeout=30)
print(f"Status Code: {response.status_code}")
print(f"URL: {response.url}")
Status Code: 200
URL: https://api.themoviedb.org/3/discover/movie?api_key=f6600b8ecd0876798ce46888f63faf47&sort_by=revenue.desc&release_date.gte=2020-01-01&release_date.lte=2024-12-31&page=1

TMDB API: Understanding Query Parameters

Each parameter customizes what movies we get back:

params = {
    "api_key": API_KEY,           # Authentication (required)
    "sort_by": "revenue.desc",    # Sort by revenue, descending
    "release_date.gte": "2020-01-01",  # Released on or after
    "release_date.lte": "2024-12-31",  # Released on or before
    "page": 1                     # Which page of results
}
Parameter Value Effect
sort_by revenue.desc Highest-grossing first
release_date.gte 2020-01-01 Only movies from 2020+
release_date.lte 2024-12-31 Only movies up to 2024
page 1 First 20 results

How Do We Know What Parameters Are Available?

Read the API documentation! Every API publishes docs explaining:

  • Available endpoints (URLs)
  • Required vs optional parameters
  • Authentication method
  • Response format and fields

TMDB Discover Movies API Docs

How Do We Know What Parameters Are Available?

TMDB API: Understanding the Response

response.json parses the HTTP response body as JSON and return it as normal Python objects (usually a dict or list)

data = response.json()

# The response contains pagination info and results
print(f"Total results: {data.get('total_results')}")
print(f"Total pages: {data.get('total_pages')}")
print(f"Movies on this page: {len(data.get('results', []))}")
Total results: 240444
Total pages: 12023
Movies on this page: 20

TMDB API: Extracting Movie Data

# Extract movie information from results this turns into a list
movies = []
for movie in data.get('results', []):
    movies.append({
        'movie_id': movie.get('id'),
        'title': movie.get('title'),
        'release_date': movie.get('release_date'),
        'popularity': movie.get('popularity'),
        'vote_average': movie.get('vote_average')
    })

# Convert to DataFrame
df_movies = pd.DataFrame(movies)
print(f"Found {len(df_movies)} movies")
df_movies.head(10)
Found 20 movies
movie_id title release_date popularity vote_average
0 19995 Avatar 2009-12-16 40.8368 7.600
1 299534 Avengers: Endgame 2019-04-24 16.6766 8.236
2 76600 Avatar: The Way of Water 2022-12-14 36.5326 7.600
3 597 Titanic 1997-12-18 32.0289 7.900
4 140607 Star Wars: The Force Awakens 2015-12-15 9.6842 7.252
5 299536 Avengers: Infinity War 2018-04-25 27.4076 8.234
6 634649 Spider-Man: No Way Home 2021-12-15 24.5079 7.933
7 1022789 Inside Out 2 2024-06-11 20.2485 7.546
8 420818 The Lion King 2019-07-12 8.9089 7.097
9 24428 The Avengers 2012-04-25 58.9549 7.931

TMDB API: Getting Movie Details

Each movie has additional details available that we can extract.

Here we extract one movie and see several attributes:

# Get details for a specific movie (e.g., movie_id = 550 is Fight Club)
movie_id = 550
detail_url = f"{BASE_URL}/movie/{movie_id}"

detail_response = requests.get(detail_url, params={"api_key": API_KEY}, timeout=30)
movie_details = detail_response.json()

print(f"Title: {movie_details.get('title')}")
print(f"Budget: ${movie_details.get('budget'):,}")
print(f"Revenue: ${movie_details.get('revenue'):,}")
print(f"Runtime: {movie_details.get('runtime')} minutes")
print(f"Genres: {[g['name'] for g in movie_details.get('genres', [])]}")
Title: Fight Club
Budget: $63,000,000
Revenue: $100,853,753
Runtime: 139 minutes
Genres: ['Drama', 'Thriller']

TMDB API: Fetching Multiple Pages

The discover endpoint returns 20 movies per page. To get more data, we loop through pages:

import time

# Fetch first 3 pages of results (60 movies)
all_movies = []

for page in range(1, 4):
    params = {
        "api_key": API_KEY,
        "sort_by": "revenue.desc",
        "release_date.gte": "2020-01-01",
        "page": page
    }

    resp = requests.get(f"{BASE_URL}/discover/movie", params=params, timeout=30)
    data = resp.json()

    for movie in data.get('results', []):
        all_movies.append({
            'id': movie.get('id'),
            'title': movie.get('title'),
            'popularity': movie.get('popularity')
        })

    print(f"Page {page}: fetched {len(data.get('results', []))} movies")
    time.sleep(0.25)  # Be polite: wait 250ms between requests

print(f"\nTotal movies collected: {len(all_movies)}")
Page 1: fetched 20 movies
Page 2: fetched 20 movies
Page 3: fetched 20 movies

Total movies collected: 60

Web API Example 2: HHS Health Recommendations

The Health.gov API provides demographic-specific health recommendations. This API does not require a key or token

url = "https://health.gov/myhealthfinder/api/v3/myhealthfinder.json"
params = {
    "lang": "en",
    "age": "32",
    "sex": "male",
    "tobaccoUse": 0
}
headers = {"accept": "application/json"}

response = requests.get(url, params=params, headers=headers, timeout=60)
print(f"Status Code: {response.status_code}")
Status Code: 200

HHS: Extracting Health Recommendations

data = response.json()

# Extract recommendation titles
titles = []
resources = data.get('Result', {}).get('Resources', {})

for category in resources.values():
    if isinstance(category, dict) and 'Resource' in category:
        for resource in category['Resource']:
            titles.append(resource.get('Title', 'N/A'))

print("Health Recommendations:")
for title in titles[:10]:
    print(f"  - {title}")
Health Recommendations:
  - Quit Smoking
  - Hepatitis C Screening: Questions for the Doctor
  - Protect Yourself from Seasonal Flu
  - Talk with Your Doctor About Depression
  - Get Your Blood Pressure Checked
  - Get Tested for HIV
  - Get Vaccines to Protect Your Health (Adults Ages 19 to 49 Years)
  - Drink Alcohol Only in Moderation
  - Talk with Your Doctor About Drug Misuse and Substance Use Disorder
  - Aim for a Healthy Weight

HHS: params

HHS

Why Use a Dictionary for Parameters?

Instead of this:

url = "https://api.example.com/data?age=32&sex=male&lang=en"
response = requests.get(url)

Do this:

url = "https://api.example.com/data"
params = {"age": 32, "sex": "male", "lang": "en"}
response = requests.get(url, params=params)

Benefits:

  • Automatic URL encoding (handles special characters)
  • Easier to read and modify
  • No manual string concatenation errors

Common Parameter Patterns in APIs

Parameter Type Example Purpose
Filtering ?status=active Only return matching items
Pagination ?page=2&limit=50 Control result batches
Sorting ?sort=date&order=desc Order results
Fields ?fields=name,email Select specific data
Search ?q=python Text search
Format ?format=json Response format

Timeout and Connection Options

Sometimes APIs are slow. Use the timeout parameter:

# Set connection and read timeout (in seconds)
response = requests.get(
    url,
    params=params,
    timeout=(10, 60)  # (connect timeout, read timeout)
)

# Or single timeout for both
response = requests.get(url, timeout=60)

Rate Limiting

Many APIs limit how many requests you can make. Use time.sleep() to be polite:

import time

for movie_id in movie_ids:
    response = requests.get(f"{BASE_URL}/movie/{movie_id}", params=params)
    data = response.json()
    # Process data...
    time.sleep(0.5)  # Wait 500ms between requests

Why rate limit?

  • Avoid getting blocked (HTTP 429: Too Many Requests)
  • Be a good API citizen
  • Most APIs specify limits in their documentation

POST Requests: Sending Data

So far we’ve used GET to retrieve data. POST sends data to a server:

# GET: retrieve data (parameters in URL)
response = requests.get(url, params={"query": "python"})

# POST: send data (data in request body)
response = requests.post(url, json={"name": "John", "email": "john@example.com"})

Common POST use cases:

  • Creating new records (users, posts, comments)
  • Submitting forms
  • Sending data for processing (e.g., ML model predictions)

Note: Most data retrieval APIs use GET; POST is mainly for writing data.

Error Handling

def safe_api_call(url, params=None, timeout=30):
    """Make an API call with proper error handling."""
    try:
        response = requests.get(url, params=params, timeout=timeout)
        response.raise_for_status()  # Raises exception for 4xx/5xx
        return response.json()
    except requests.exceptions.Timeout:
        print("Request timed out")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP error: {e}")
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    return None

# Example usage
data = safe_api_call("https://api.github.com/users/octocat")
if data:
    print(f"User: {data.get('login')}")
User: octocat

Error Handling: Why Each Part Matters

Component What it does Why it matters
try: Wraps risky operations APIs can fail for many reasons outside your control
timeout=30 Limits wait time Prevents your script from hanging indefinitely
raise_for_status() Converts HTTP 4xx/5xx to exceptions Without this, error responses look “successful”
Timeout exception Catches slow/unresponsive servers Network issues, server overload
HTTPError exception Catches bad responses (404, 500, etc.) Invalid URLs, rate limiting (429), server errors
RequestException Catches everything else DNS failures, connection refused, SSL errors
return None Signals failure gracefully Caller can check if data: instead of crashing

Without error handling, a single failed API call would crash your entire script—especially problematic when looping through hundreds of requests.

API Best Practices

  1. Read the documentation - Every API is different
  2. Respect rate limits - Use time.sleep() between requests
  3. Handle errors gracefully - Check status codes
  4. Use timeouts - Don’t hang indefinitely
  5. Store tokens securely - Never commit API keys to git!
# Use environment variables for tokens
import os
API_KEY = os.environ.get('TMDB_API_KEY')

Summary: APIs

  • Use the requests library for HTTP calls
  • Pass parameters as dictionaries
  • Pass tokens in headers
  • Handle timeouts and errors
  • Parse JSON responses with .json()
import requests

response = requests.get(url, params=params, headers=headers, timeout=60)
data = response.json()

Fundamentals of Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites – Wikipedia

How in Python?

  • requests: Fetch web pages
  • pandas.read_html(): Extract tables directly
  • BeautifulSoup: Parse and navigate HTML/XML documents
  • selenium: For dynamic websites with JavaScript

Scraping Data from a Webpage

Webpages contain data but are written in HTML, CSS, and JavaScript.

Inspecting HTML with Browser DevTools

To scrape a webpage, you need to understand its HTML structure:

  1. Right-click on the element you want → Inspect
  2. The Elements panel shows the HTML structure
  3. Hover over elements to highlight them on the page
  4. Look for patterns: tag names, classes, IDs

Common HTML elements to look for:

Element Tag Example
Tables <table> Data tables
Links <a href="..."> Navigation, references
Paragraphs <p> Text content
Divs <div class="..."> Content containers
Lists <ul>, <ol>, <li> Bulleted/numbered items

HTML Tables

HTML tables use these tags:

  • <table> - the container
  • <tr> - table row
  • <th> - header cell
  • <td> - data cell
<table>
  <tr><th>Film</th><th>Year</th><th>Awards</th></tr>
  <tr><td>Titanic</td><td>1997</td><td>11</td></tr>
  <tr><td>Avatar</td><td>2009</td><td>3</td></tr>
</table>

Setup for Scraping

import requests
import pandas as pd
from io import StringIO
import re
from bs4 import BeautifulSoup

Making Requests with Headers

Always include a User-Agent header to identify yourself:

# URLs for our movie data
BOXOFFICE_URL = "https://en.wikipedia.org/wiki/List_of_highest-grossing_films"
OSCARS_URL = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"

# Headers to identify ourselves
HEADERS = {
    "User-Agent": "jsc370-class-project/1.0 (educational use)",
    "Accept-Language": "en-US,en;q=0.9",
}

Why Use Headers?

1. Servers may block requests without a User-Agent

Without headers, requests sends a default like python-requests/2.28.0. Many websites block or rate-limit requests that look like bots.

2. Identifying yourself is good etiquette

If your script causes problems, site admins can contact you instead of blocking your IP.

3. Accept-Language controls content language

Wikipedia serves content in different languages based on this header.

Header Purpose Example
User-Agent Identifies the client "MyApp/1.0 (contact@email.com)"
Accept-Language Preferred language "en-US,en;q=0.9"
Accept Expected response format "application/json"
Authorization API tokens/credentials "Bearer abc123"

Fetching the Data

Use requests.get to grab the website data from the url

# Request the box office page
boxoffice_request = requests.get(BOXOFFICE_URL, headers=HEADERS, timeout=30)

if boxoffice_request.ok:
    print("Box office response OK:", boxoffice_request.status_code)
else:
    print("Request failed:", boxoffice_request.status_code)

# Request the Oscars page
oscars_request = requests.get(OSCARS_URL, headers=HEADERS, timeout=30)
print("Oscars response:", oscars_request.status_code)
Box office response OK: 200
Oscars response: 200

Parsing Tables

We want to turn HTML <table> tags into pandas DataFrames.

  • pd.read_html() scans HTML and returns a list of DataFrames—one per table found.
  • StringIO() wraps the HTML string to look like a file-like object.
# Parse all tables from each page
box_tables = pd.read_html(StringIO(boxoffice_request.text))
osc_tables = pd.read_html(StringIO(oscars_request.text))

print(f"Found {len(box_tables)} tables on box office page")
print(f"Found {len(osc_tables)} tables on Oscars page")

# Get the main tables (first table on each page)
box = box_tables[0]
osc = osc_tables[0]
Found 91 tables on box office page
Found 2 tables on Oscars page

If the page has multiple tables, you might need to choose by inspecting box_tables[i].columns or by matching a column name.

Inspecting the Data

Print out the data in box and osc

print("Box Office Data:")
print(box.head(), '\n')

print("Oscar Winners Data:")
print(osc.head())
Box Office Data:
   Rank Peak                     Title   Worldwide gross  Year          Ref
0     1    1                    Avatar    $2,923,710,708  2009   [# 1][# 2]
1     2    1         Avengers: Endgame    $2,797,501,328  2019   [# 3][# 4]
2     3    3  Avatar: The Way of Water    $2,334,484,620  2022   [# 5][# 6]
3     4    1                   Titanic   T$2,257,906,828  1997   [# 7][# 8]
4     5    5                  Ne Zha 2  NZ$2,215,690,000  2025  [# 9][# 10] 

Oscar Winners Data:
             Film  Year Awards Nominations
0           Anora  2024      5           6
1   The Brutalist  2024      3          10
2    Emilia Pérez  2024      2          13
3          Wicked  2024      2          10
4  Dune: Part Two  2024      2           5

Merging DataFrames

Combine data using merge():

osc["Year"] = pd.to_numeric(osc["Year"].astype(str).str.extract(r"(\d{4})")[0], errors="coerce")
osc["Year"] = osc["Year"].astype("Int64")

# Left join: keep all box office films, add Oscar data where available
merged = box.merge(
    osc,
    how='left',
    left_on=['Title', 'Year'],
    right_on=['Film', 'Year']
)

This requires a few steps due to messy scraped data. We will come back to this later

Scraping Beyond Tables: BeautifulSoup

While pd.read_html() is great for tables, BeautifulSoup can extract any HTML element:

# Parse the box office page with BeautifulSoup
soup = BeautifulSoup(boxoffice_request.content, 'html.parser')

# Find the first paragraph
first_paragraph = soup.find('p')
print("First paragraph:", first_paragraph.get_text()[:100], "...")

# Find all section headings
headings = soup.find_all('h2')
print(f"\nFound {len(headings)} section headings")

# Find the table of contents
toc = soup.find(id='toc')
print(f"Table of contents found: {toc is not None}")
First paragraph: 
 ...

Found 10 section headings
Table of contents found: False

Key BeautifulSoup methods:

  • soup.find('tag') - first matching element
  • soup.find_all('tag') - all matching elements
  • element.get_text() - extract text content
  • element['attribute'] - get attribute value (e.g., href)

Regular Expressions in Python

Why regex? Scraped data is messy. Years appear as “2020/21”, currencies as “$2,923,710,708”, and text contains footnotes like “[1]”. Regex lets you extract and clean patterns programmatically.

The re module provides regex support:

import re

text = "Contact us at support@example.com or sales@company.org"

# Find all email addresses
pattern = r'[\w\.-]+@[\w\.-]+'
emails = re.findall(pattern, text)
print(f"Found emails: {emails}")

# Replace pattern
new_text = re.sub(pattern, "[EMAIL]", text)
print(f"Redacted: {new_text}")
Found emails: ['support@example.com', 'sales@company.org']
Redacted: Contact us at [EMAIL] or [EMAIL]

Breaking Down the Email Pattern

r'[\w\.-]+@[\w\.-]+'
Part Meaning Matches
[\w\.-] Character class: word chars, dots, hyphens s, u, p, ., -
+ One or more of the previous support, example.com
@ Literal @ symbol @
[\w\.-]+ Same pattern after @ example.com

Key re functions:

Function Purpose Example
re.findall(pattern, text) Find all matches Returns list of matches
re.search(pattern, text) Find first match Returns match object or None
re.sub(pattern, repl, text) Replace matches Returns modified string

Common Regex Patterns

Pattern Meaning
\d Digit (0-9)
\w Word character (a-z, A-Z, 0-9, _)
\s Whitespace
. Any character
* Zero or more
+ One or more
? Zero or one
[] Character class
^ Start of string
$ End of string

Cleaning Data with Regular Expressions

Some values need cleaning (e.g., “2020/21” should be “2020”):

# Extract just the first 4-digit year
year_example = "2020/21"
clean_year = re.search(r'\d{4}', year_example).group()
print(f"'{year_example}' -> '{clean_year}'")

# Clean currency: "$2,923,710,708" -> 2923710708
gross_example = "$2,923,710,708"
clean_gross = re.sub(r'[^\d]', '', gross_example)
print(f"'{gross_example}' -> {int(clean_gross)}")
'2020/21' -> '2020'
'$2,923,710,708' -> 2923710708

Understanding the Cleaning Patterns

Pattern 1: Extract year r'\d{4}'

Part Meaning
\d Any digit (0-9)
{4} Exactly 4 of them

Matches: "2020" from "2020/21" — ignores the /21 part

Pattern 2: Remove non-digits r'[^\d]'

Part Meaning
[^...] NOT these characters (negation)
\d Digits
Together Match anything that is NOT a digit

re.sub(r'[^\d]', '', text) replaces all non-digits with nothing, leaving only numbers.

"$2,923,710,708""2923710708"

Cleaning the Year Column

# Extract 4-digit year and convert to int
osc['Year_clean'] = (osc['Year']
    .astype(str)
    .str.extract(r'(\d{4})', expand=False)
    .astype('int', errors='ignore'))

print(osc[['Year', 'Year_clean']].head(10))
   Year  Year_clean
0  2024        2024
1  2024        2024
2  2024        2024
3  2024        2024
4  2024        2024
5  2024        2024
6  2024        2024
7  2024        2024
8  2024        2024
9  2024        2024

Merging Box Office and Oscar Data

# Clean the year column first
osc['Year'] = pd.to_numeric(
    osc['Year'].astype(str).str.extract(r'(\d{4})', expand=False),
    errors='coerce'
)

# Merge the dataframes
merged = box.merge(
    osc,
    how='left',
    left_on=['Title', 'Year'],
    right_on=['Film', 'Year']
)

print(f"Merged shape: {merged.shape}")
merged[['Title', 'Year', 'Worldwide gross', 'Awards']].head(10)
Merged shape: (50, 10)
Title Year Worldwide gross Awards
0 Avatar 2009 $2,923,710,708 3
1 Avengers: Endgame 2019 $2,797,501,328 NaN
2 Avatar: The Way of Water 2022 $2,334,484,620 1
3 Titanic 1997 T$2,257,906,828 11
4 Ne Zha 2 2025 NZ$2,215,690,000 NaN
5 Star Wars: The Force Awakens 2015 $2,068,223,624 NaN
6 Avengers: Infinity War 2018 $2,048,359,754 NaN
7 Spider-Man: No Way Home 2021 SM$1,922,598,800 NaN
8 Zootopia 2 † 2025 $1,777,638,637 NaN
9 Inside Out 2 2024 $1,698,863,816 NaN

Cleaning Worldwide Gross

# Extract numbers from gross column using regex
# Pattern matches: 1,234,567 or just 1234567
gross_clean = (merged['Worldwide gross']
    .astype(str)
    .str.extract(r'(\d{1,3}(?:,\d{3})+|\d{4,})', expand=False)
    .str.replace(',', '', regex=False)
    .astype('Int64'))

merged['gross_clean'] = gross_clean
merged[['Title', 'Worldwide gross', 'gross_clean']].head()
Title Worldwide gross gross_clean
0 Avatar $2,923,710,708 2923710708
1 Avengers: Endgame $2,797,501,328 2797501328
2 Avatar: The Way of Water $2,334,484,620 2334484620
3 Titanic T$2,257,906,828 2257906828
4 Ne Zha 2 NZ$2,215,690,000 2215690000

Handling Missing Values

# Fill missing Awards with 0 (no Oscar wins)
merged['Awards'] = merged['Awards'].fillna(0).astype('int')

# Create indicator for Oscar winners
merged['won_oscar'] = merged['Awards'] >= 1

print(merged[['Title', 'Awards', 'won_oscar']].head(10))
                          Title  Awards  won_oscar
0                        Avatar       3       True
1             Avengers: Endgame       0      False
2      Avatar: The Way of Water       1       True
3                       Titanic      11       True
4                      Ne Zha 2       0      False
5  Star Wars: The Force Awakens       0      False
6        Avengers: Infinity War       0      False
7       Spider-Man: No Way Home       0      False
8                  Zootopia 2 †       0      False
9                  Inside Out 2       0      False

Comparing Oscar Winners vs Non-Winners

# Group by Oscar status and compute statistics
summary = (merged
    .groupby('won_oscar')['gross_clean']
    .agg(num_movies='count', avg_gross='mean', median_gross='median'))

print(summary)
           num_movies          avg_gross  median_gross
won_oscar                                             
False              37  1409896224.810811  1308476166.0
True               13  1510293283.384615  1290000000.0

Visualizing the Results

import matplotlib.pyplot as plt

no_wins = merged[merged['won_oscar'] == False]['gross_clean'].dropna()
winners = merged[merged['won_oscar'] == True]['gross_clean'].dropna()

plt.boxplot([no_wins, winners], tick_labels=["No Oscar", "Oscar Winner"])
plt.ylabel("Worldwide Gross (USD)")
plt.title("Box Office Revenue: Oscar Winners vs Non-Winners")
plt.show()

Summary: Web Scraping

  • Use BeautifulSoup to parse HTML
  • Use pandas.read_html() for tables
  • Clean data with regular expressions
  • Be respectful - check robots.txt
from bs4 import BeautifulSoup
import requests
import pandas as pd

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
tables = pd.read_html(str(soup))

Resources