import pandas as pd
import numpy as np
from plotnine import *
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
from pygam import LinearGAM, s
import statsmodels.api as sm
from folium.plugins import MarkerClusterLab 04 - Data Visualization and GAMs
Learning Goals
- Read in and prepare the meteorological dataset
- Use
pd.merge()to join two datasets - Deal with missings and impute data
- Create several graphs with different
geomsinplotnine - Create a facet graph
- Conduct customizations of the graphs
- Fit smooth regression models using
pygamand compare to a linear regression model
Lab Description
We will work with the meteorological data from last week’s lab.
The objective of the lab is to examine the association between weather variables in the US, practice data visualization, and fit smooth regression models.
1. Read in the data
First download and then read in with pandas:
url = "https://raw.githubusercontent.com/JSC370/JSC370-2026/main/data/met_all_2025.gz"
met = pd.read_csv(url, compression="gzip")2. Prepare the data: some wrangling
- From last week: remove temperatures less than -20C and change 999.9 to NaN.
- Generate a date variable using
pd.to_datetime(). - Using date filtering, keep the observations of the first week of July 2025.
- Compute the mean by station of the variables
temp,rh,wind_sp,vis_dist,dew_point,lat,lon, andelev. - Create a region variable for NW, SW, NE, SE based on lon = -98.00 and lat = 39.71 degrees.
- Create a categorical variable for elevation (low: < 252m, high: >= 252m)
# Replace 999.9 with NaN and filter temps > -20
met.loc[met['temp'] == 999.9, 'temp'] = np.nan
met = met[met['temp'] > -20].copy()
# Create date variable
met['date'] = pd.to_datetime(
met[['year', 'month', 'day', 'hour']])
# Create region variable using np.select
met['region'] = (
np.select(
[
(met['lon'] < -98) & (met['lat'] >= 39.71),
(met['lon'] >= -98) & (met['lat'] >= 39.71),
(met['lon'] < -98) & (met['lat'] < 39.71),
],
['NW', 'NE', 'SW'],
default='SE'
)
)
# Create elevation category
met['elev_high_low'] = np.select(
[met['elev'] >= 252],
['high'],
default='low'
)
display(met.head())| USAFID | WBAN | year | month | day | hour | min | lat | lon | elev | ... | temp | temp_qc | dew_point | dew_point_qc | atm_press | atm_press_qc | rh | date | region | elev_high_low | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 690150 | 93121 | 2025 | 6 | 1 | 0 | 26 | 34.296 | -116.162 | 625 | ... | 37.2 | 1 | 4.4 | 1 | NaN | 9 | 12.952344 | 2025-06-01 00:00:00 | SW | high |
| 1 | 690150 | 93121 | 2025 | 6 | 1 | 0 | 54 | 34.296 | -116.162 | 625 | ... | 36.7 | 1 | 4.4 | 1 | 1008.4 | 1 | 13.324572 | 2025-06-01 00:00:00 | SW | high |
| 2 | 690150 | 93121 | 2025 | 6 | 1 | 1 | 54 | 34.296 | -116.162 | 625 | ... | 34.4 | 1 | 4.4 | 1 | 1008.4 | 1 | 15.194316 | 2025-06-01 01:00:00 | SW | high |
| 3 | 690150 | 93121 | 2025 | 6 | 1 | 2 | 54 | 34.296 | -116.162 | 625 | ... | 32.2 | 1 | 5.6 | 1 | 1008.9 | 1 | 18.773303 | 2025-06-01 02:00:00 | SW | high |
| 4 | 690150 | 93121 | 2025 | 6 | 1 | 3 | 54 | 34.296 | -116.162 | 625 | ... | 31.1 | 1 | 5.6 | 1 | 1009.1 | 1 | 20.015403 | 2025-06-01 03:00:00 | SW | high |
5 rows × 33 columns
3. Use geom_violin to examine dew_point for low and high elevations by region
Use geom_violin and subset the data to the first two weeks in July.
- Subset to the first two weeks in July
- Use facets
- Summarize below
Summary:
4. Use geom_bar to create barplots of the proportion of weather stations by elevation category colored by region
- Use the subset data from #3, the first two weeks of July
- Create nice labels on axes and add a title
- Try a second plot with counts and
dodgepositioning - Summarize below
Summary:
5. Use stat_summary to examine mean dew point by region with standard deviation error bars
- Use
stat_summarywith appropriate functions for mean and standard deviation - Add error bars using another layer of
stat_summarywithgeom = "errorbar" - Use
coord_flip - Add labels and a title
- Summarize below
Summary:
6. Smooth Regression with GAMs
Let’s practice running regression models with smooth functions on X. We use the statsmodels OLS for linear models and pygam package and LinearGAM function to do this.
- Use the subsetted data. First remove NaN before fitting
- Fit both a linear model with
sm.OLSand a spline model (useLinearGAM()withs()for a smooth term on wind_sp and temp). - For the spline model try
n_splines= 20 - Summarize and plot the results from the models.
- Now fit linear model with sm.OLS
Summary:
- Report adjusted R2
- Are the beta coefficients for wind speed and temperature significant?
Summary:
- Report pseudo R2, how does it compare to the linear model R2?
- What are the EDoF for wind speed and temp?
- Are the smooths for wind speed and temperature significant?
Summary:
- Visual inspection of the fitted curves
- Does the smooth term capture meaningful non-linearity?