Lab 04 - Data Visualization and GAMs

import pandas as pd
import numpy as np
from plotnine import *
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
from pygam import LinearGAM, s
import statsmodels.api as sm
from folium.plugins import MarkerCluster

Learning Goals

  • Read in and prepare the meteorological dataset
  • Use pd.merge() to join two datasets
  • Deal with missings and impute data
  • Create several graphs with different geoms in plotnine
  • Create a facet graph
  • Conduct customizations of the graphs
  • Fit smooth regression models using pygam and compare to a linear regression model

Lab Description

We will work with the meteorological data from last week’s lab.

The objective of the lab is to examine the association between weather variables in the US, practice data visualization, and fit smooth regression models.

1. Read in the data

First download and then read in with pandas:

url = "https://raw.githubusercontent.com/JSC370/JSC370-2026/main/data/met_all_2025.gz"
met = pd.read_csv(url, compression="gzip")

2. Prepare the data: some wrangling

  • From last week: remove temperatures less than -20C and change 999.9 to NaN.
  • Generate a date variable using pd.to_datetime().
  • Using date filtering, keep the observations of the first week of July 2025.
  • Compute the mean by station of the variables temp, rh, wind_sp, vis_dist, dew_point, lat, lon, and elev.
  • Create a region variable for NW, SW, NE, SE based on lon = -98.00 and lat = 39.71 degrees.
  • Create a categorical variable for elevation (low: < 252m, high: >= 252m)
# Replace 999.9 with NaN and filter temps > -20
met.loc[met['temp'] == 999.9, 'temp'] = np.nan
met = met[met['temp'] > -20].copy()

# Create date variable
met['date'] = pd.to_datetime(
    met[['year', 'month', 'day', 'hour']])

# Create region variable using np.select
met['region'] = (
    np.select(
        [
            (met['lon'] < -98) & (met['lat'] >= 39.71),
            (met['lon'] >= -98) & (met['lat'] >= 39.71),
            (met['lon'] < -98) & (met['lat'] < 39.71),
        ],
        ['NW', 'NE', 'SW'],
        default='SE'
    )
)

# Create elevation category
met['elev_high_low'] = np.select(
    [met['elev'] >= 252],
    ['high'],
    default='low'
)

display(met.head())
USAFID WBAN year month day hour min lat lon elev ... temp temp_qc dew_point dew_point_qc atm_press atm_press_qc rh date region elev_high_low
0 690150 93121 2025 6 1 0 26 34.296 -116.162 625 ... 37.2 1 4.4 1 NaN 9 12.952344 2025-06-01 00:00:00 SW high
1 690150 93121 2025 6 1 0 54 34.296 -116.162 625 ... 36.7 1 4.4 1 1008.4 1 13.324572 2025-06-01 00:00:00 SW high
2 690150 93121 2025 6 1 1 54 34.296 -116.162 625 ... 34.4 1 4.4 1 1008.4 1 15.194316 2025-06-01 01:00:00 SW high
3 690150 93121 2025 6 1 2 54 34.296 -116.162 625 ... 32.2 1 5.6 1 1008.9 1 18.773303 2025-06-01 02:00:00 SW high
4 690150 93121 2025 6 1 3 54 34.296 -116.162 625 ... 31.1 1 5.6 1 1009.1 1 20.015403 2025-06-01 03:00:00 SW high

5 rows × 33 columns

3. Use geom_violin to examine dew_point for low and high elevations by region

Use geom_violin and subset the data to the first two weeks in July.

  • Subset to the first two weeks in July
  • Use facets
  • Summarize below

Summary:

4. Use geom_bar to create barplots of the proportion of weather stations by elevation category colored by region

  • Use the subset data from #3, the first two weeks of July
  • Create nice labels on axes and add a title
  • Try a second plot with counts and dodge positioning
  • Summarize below

Summary:

5. Use stat_summary to examine mean dew point by region with standard deviation error bars

  • Use stat_summary with appropriate functions for mean and standard deviation
  • Add error bars using another layer of stat_summary with geom = "errorbar"
  • Use coord_flip
  • Add labels and a title
  • Summarize below

Summary:

6. Smooth Regression with GAMs

Let’s practice running regression models with smooth functions on X. We use the statsmodels OLS for linear models and pygam package and LinearGAM function to do this.

  • Use the subsetted data. First remove NaN before fitting
  • Fit both a linear model with sm.OLS and a spline model (use LinearGAM() with s() for a smooth term on wind_sp and temp).
  • For the spline model try n_splines = 20
  • Summarize and plot the results from the models.
  • Now fit linear model with sm.OLS

Summary:

  • Report adjusted R2
  • Are the beta coefficients for wind speed and temperature significant?

Summary:

  • Report pseudo R2, how does it compare to the linear model R2?
  • What are the EDoF for wind speed and temp?
  • Are the smooths for wind speed and temperature significant?

Summary:

  • Visual inspection of the fitted curves
  • Does the smooth term capture meaningful non-linearity?