JSC 370: Data Science II

Week 8: Decision Trees, Random Forests, Boosting and Gradient Boosting

Outline

  • Part 1: Motivation for Classification and Regression Trees
  • Part 2: Decision Trees
  • Part 3: Ensemble methods including Bagging, Random Forests, Boosting, Gradient Boosting, XGBoost

Geometry of Data for Classification

  • The decision boundary is defined where the probability of being in class 1 and class 0 are equal, i.e.

\[P(Y=1) = P(Y=0) \rightarrow P(Y=1) = 0.5\] - In logistic regression this is equivalent the log-odds=0: \(x\beta=0\)

Geometry of Data for Classification

  • Here we are classifying vegetation and non-vegetation
  • The decision boundary is \[−0.8 x_1+x_2=0 \rightarrow x_2=0.8x_1\]
  • This translates to latitude \(=0.8\times\) longitude

Geometry of Data for Classification

  • Logistic regression for classification works best when the classes are well separated in the feature space
  • Linear boundaries are easy to interpret, but not straightforward in non-linear cases

Geometry of Data for Classification

  • LHS: Multiple linear boundaries that form squares will perform better

  • RHS: Circular boundaries will perform better

Geometry of Data for Regression

  • In regression, the goal is to predict a continuous outcome rather than a class label

  • Instead of finding decision boundaries that separate classes, we partition the feature space into regions where we predict the mean response

  • Linear regression fits a global model: \(\hat y = x\beta\), which works well when the relationship is linear

  • But what if the relationship is non-linear or involves interactions?
    • We could add polynomial terms or interaction terms, but this requires knowing the form in advance
    • GAM models were a step in this direction
    • Tree-based methods automatically discover non-linear relationships and interactions by recursively partitioning the feature space

Regression Trees

  • A regression tree splits the feature space into \(M\) distinct, non-overlapping regions \(R_1, R_2, \dots, R_M\)
  • For each region, we predict the mean of the training responses in that region: \[ \hat y_{R_m} = \frac{1}{|R_m|} \sum_{i \in R_m} y_i \]
  • To build the tree, we minimize the residual sum of squares (RSS): \[\text{RSS} = \sum_{m=1}^{M} \sum_{i \in R_m} (y_i - \hat{y}_{R_m})^2\]
  • At each step, we choose the predictor \(j\) and split point \(s\) that minimize: \[\sum_{i: x_i \in R_1(j,s)} (y_i - \hat y_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat y_{R_2})^2\]

Decision Trees

  • Simple flow charts can be formulated as mathematical models for both classification and regression.
  • Properties:
    • Interpretable by humans.
    • Sufficiently complex decision boundaries.
    • Locally linear decision boundaries.

Decision Tree: Classification

  • Involve stratifying or segmenting the space into simple regions.

Decision Tree: Splitting

Formally, a decision tree model is one in which the final outcome of the model is based on a series of comparisons of the values of predictors against threshold values. Each comparison and branching represents splitting a region in the feature space on a single feature. Typically, at each iteration, we split once along one dimension (one predictor).

Decision Tree Terminology

  • Root node: the top of the tree — contains all observations before any split
  • Internal node: where a split occurs — applies a rule like “is \(x_j \leq t\)?” and sends observations left or right
  • Split: the act of dividing a node into two child nodes based on a feature and threshold
  • Leaf node (terminal node): where splitting has stopped — holds the final prediction
    • Classification: the majority class in that leaf
    • Regression: the mean response in that leaf
  • Depth: how many splits deep a node is from the root

Every path from root to leaf represents a series of if-then rules — this is what makes decision trees interpretable

Decision Tree: Regression

  • Predict grade from study time

Decision Tree: Regression

  • The tree splits study time into \(M\) distinct, non-overlapping regions \(R_1, R_2, \dots, R_M\)

Learning the Tree Model

  1. Start with an empty decision tree.
  2. Choose the ‘optimal’ predictor and threshold for splitting.
  3. Recurse on each new node until stopping condition is met.
  4. Define the splitting criterion and stopping condition.

We need to define the splitting criterion and stopping condition

Greedy Algorithms

  • Always makes the choice that seems best at the moment.
  • Ensures local optimality at each step.
  • Makes greedy choices at each step to ensure that the objective function is optimized.
  • Never reverses a decision.

Example: Making change for $0.63

  • Available coins: quarters (25¢), dimes (10¢), nickels (5¢), pennies (1¢)
  • Greedy approach: always pick the largest coin that fits
    • 25¢ → 25¢ → 10¢ → 1¢ → 1¢ → 1¢ = 6 coins
  • In decision trees: at each node, pick the single split (feature + threshold) that gives the best improvement — without considering whether a different split now might lead to a better tree overall

Optimality of Splitting

  • The greedy algorithm needs a metric to decide the “best” split at each node
  • No single ‘correct’ way to define an optimal split, but two common approaches:
  • Classification: minimize impurity — how mixed are the classes in each region?
    • Gini Index (most common), Entropy / Information Gain
  • Regression: minimize RSS — how far are observations from the region mean?
  • Common sense guidelines:
    • Feature space should grow progressively more pure (classification) or more homogeneous (regression) with splits
    • Fitness metric of a split should be differentiable
    • Avoid empty regions with no training points

Gini Index

  • The Gini Index is a metric used to measure the impurity or homogeneity of a dataset at a node.
  • It helps in determining the best feature to split on when building the tree.

Gini Index

  • Suppose we have \(J\) predictors, \(N\) number of training points and \(K\) classes.
  • Suppose we select the \(j\)-th predictor and split a region containing \(N\) number of training points along the threshold \(t_j \in R\).
  • We can assess the quality of this split by measuring the purity of each newly created region, \(R_1,R_2\). This metric is called the Gini Index: \[Gini = 1 - \sum_{i=1}^{k} p(k|R_i)^2\]

Gini Index

Understanding Gini Index

  • If all samples at a node belong to the same class, Gini = 0 (pure node).
  • If samples are evenly distributed among classes, Gini is maximized.
  • The goal of splitting in decision trees (like CART) is to minimize the Gini Index, leading to purer nodes.

Gini Index

We can try to find the predictor \(j\) and the threshold \(t_j\) that minimizes the average Gini Index over the two regions, weighted by the population of the regions (\(N_i\) is the number of training points in region \(R_i\)):

RSS for Regression Trees

  • For regression, we use Residual Sum of Squares (RSS) instead of Gini
  • At each split, choose the predictor \(j\) and threshold \(s\) that minimize the weighted RSS across the two new regions:

\[\text{RSS}(j, s) = \sum_{i: x_i \in R_1(j,s)} (y_i - \hat y_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat y_{R_2})^2\]

where \(\hat y_{R_m}\) is the mean response in region \(R_m\)

  • Intuition: a good split creates regions where the observations are close to their region mean — i.e., the variation within each region is small
  • Like Gini, the greedy algorithm tries every feature and every possible split point, and picks the one with the lowest RSS

Splitting Criteria: Summary

Classification Regression
Goal Maximize purity Minimize variance
Metric Gini Index RSS
Prediction Majority class in region Mean response in region
Greedy choice Split that reduces Gini most Split that reduces RSS most

From Splitting to Stopping

  • We now know how to evaluate a split: Gini (classification) or RSS (regression)
  • The greedy algorithm keeps splitting — but when should it stop?
  • If we never stop, the tree grows until every leaf contains a single observation
    • Perfect training accuracy, but massive overfitting
  • We need a stopping condition to decide when a split is no longer worth making

Gain: Measuring Improvement from a Split

  • Gain measures how much a split improves the metric — it is the difference between the impurity (or RSS) of the parent node and the weighted average of the children:

\[\text{Gain}(R) = m(R) - \frac{N_1}{N} m(R_1) - \frac{N_2}{N} m(R_2)\]

  • where \(m\) is the splitting metric (Gini, entropy, or RSS), \(R\) is the parent region, \(R_1, R_2\) are the child regions, and \(N_1, N_2\) are their sizes
  • High gain: the split meaningfully separates the data — worth doing
  • Low gain: the split barely improves things — may not be worth the added complexity
  • Zero gain: no improvement — the split does nothing useful

Stopping Conditions

We can stop splitting when:

  • The gain falls below a threshold — the split doesn’t improve enough to justify
  • A node reaches a minimum number of observations (e.g., min_samples_leaf)
  • The tree reaches a maximum depth
  • A node is already pure (Gini = 0) or has zero RSS

Problem: What is the major issue with pre-specifying a stopping condition?

  • You may stop too early (miss useful splits deeper in the tree) or too late (overfit)

Solutions:

  • Try several thresholds and cross-validate to find the best one
  • Or: don’t stop at all — grow the full tree, then prune it back

Pruning

  • Instead of trying to find the right stopping condition up front, grow a large tree first, then cut it back
  • A fully grown tree overfits — it memorizes the training data, including noise

Pruning: How It Works

  • Cost-complexity pruning: add a penalty for tree size

\[\text{Cost}(T) = \text{RSS}(T) + \alpha |T|\] - \(|T|\) = number of leaf nodes, \(\alpha\) = complexity parameter - Small \(\alpha\): keep more leaves (complex tree) - Large \(\alpha\): penalize leaves heavily (simpler tree)

  • For each \(\alpha\), find the subtree that minimizes Cost(\(T\))
  • Use cross-validation to choose the best \(\alpha\)

Pruning: Before and After

  • The pruned tree is simpler, more interpretable, and generalizes better to new data
  • We trade a small increase in training error for a large decrease in test error

Summary: Decision trees

Decision trees partition training data into homogenous nodes / subgroups with similar response values.

Pros

  • Decision trees are very easy to explain to non-statisticians.
  • Easy to visualize and thus easy to interpret without assuming a parametric form

Cons

  • High variance, i.e. split a dataset in half and grow tress in each half, the result will be very different
  • Related note - they generalize poorly resulting in higher test set error rates

But there are several ways we can overcome this via ensemble models

Bagging

Bootstrap aggregation (aka bagging) is a general approach for overcoming high variance

  • Bootstrap: sample the training data with replacement

  • Aggregation: Combine the results from many trees together, each constructed with a different bootstrapped sample of the data

Bagging Algorithm

Start with a specified number of trees \(B\):

  • For each tree \(b\) in \(1, \dots, B\):

    • Construct a bootstrap sample from the training data
  • Grow a deep, unpruned, complicated (aka really overfit!) tree

To generate a prediction for a new point:

  • Regression: take the average across the \(B\) trees
  • Classification: take the majority vote across the \(B\) trees

    • assuming each tree predicts a single class (could use probabilities instead…)

Improves prediction accuracy via wisdom of the crowds - but at the expense of interpretability

  • Easy to read one tree, but how do you read \(B = 500\)?

But we can still use the measures of variable importance and partial dependence to summarize our models

Random Forest Algorithm

Random forests are an extension of bagging

  • For each tree \(b\) in \(1, \dots, B\):

    • Construct a bootstrap sample from the training data
  • Grow a deep, unpruned, complicated (aka really overfit!) tree but with a twist
  • At each split: limit the variables considered to a random subset \(m_{try}\) of original \(p\) variables

Predictions are made the same way as bagging:

  • Regression: take the average across the \(B\) trees

  • Classification: take the majority vote across the \(B\) trees

Split-variable randomization adds more randomness to make each tree more independent of each other

Introduce \(m_{try}\) as a tuning parameter: typically use \(p / 3\) (regression) or \(\sqrt{p}\) (classification)

  • \(m_{try} = p\) is bagging

Example data: MLB 2021 batting statistics

The MLB 2021 batting statistics leaderboard from Fangraphs

We aim to predict WAR (Wins Above Replacement), an advanced metric that estimates the total number of wins a player contributes to their team compared to a “replacement-level” player. A replacement-level player is a theoretical player who is readily available, typically a Triple-A call-up or a minimum-salary free agent, and represents the baseline of a “0.0 WAR” player

import pandas as pd
import numpy as np

mlb_data = pd.read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/fg_batting_2021.csv")
mlb_data.columns = mlb_data.columns.str.lower().str.replace(" ", "_")

# fix strings with % in BB% and K% to make numeric
for col in ["bb%", "k%"]:
    if col in mlb_data.columns:
        mlb_data[col] = mlb_data[col].astype(str).str.replace("%", "").str.strip()
        mlb_data[col] = pd.to_numeric(mlb_data[col], errors="coerce")

model_mlb_data = mlb_data.drop(columns=["name", "team", "playerid"], errors="ignore")
model_mlb_data.head()
g pa hr r rbi sb bb% k% iso babip avg obp slg woba xwoba wrc+ bsr off def war
0 82 354 27 66 69 2 14.4 17.2 0.336 0.346 0.336 0.438 0.671 0.462 0.439 194 0.2 40.9 -7.5 4.6
1 68 288 27 66 58 18 12.5 28.1 0.395 0.333 0.302 0.385 0.698 0.443 0.420 185 5.4 35.7 -3.2 4.2
2 79 347 16 61 52 0 13.5 17.0 0.231 0.324 0.298 0.398 0.529 0.397 0.377 157 -2.7 21.6 5.7 4.0
3 82 372 21 63 54 10 8.9 23.9 0.256 0.329 0.286 0.349 0.542 0.379 0.328 139 1.0 18.7 5.4 3.7
4 78 342 23 67 51 16 13.2 24.3 0.313 0.306 0.278 0.386 0.592 0.409 0.428 159 2.7 27.6 -2.2 3.7

MLB 2021 Batting Statistics: Variables

Column Description Column Description
g Games played babip Batting avg on balls in play
pa Plate appearances avg Batting average
hr Home runs obp On-base percentage
r Runs scored slg Slugging percentage
rbi Runs batted in woba Weighted on-base average
sb Stolen bases xwoba Expected wOBA (Statcast)
bb% Walk rate (%) wrc+ Weighted runs created plus
k% Strikeout rate (%) bsr Base running runs above avg
iso Isolated power (SLG − AVG) off Offensive runs above avg
def Defensive runs above avg

Target: war — Wins Above Replacement. Note, off, def, and bsr are direct components of WAR (WAR is approx Off + Def + BsR + replacement adjustment).

Example Random Forest

scikit-learn’s RandomForestRegressor is a popular implementation

from sklearn.ensemble import RandomForestRegressor

model_mlb_data = model_mlb_data.dropna()
X = model_mlb_data.drop(columns=["war"])
y = model_mlb_data["war"]

init_mlb_rf = RandomForestRegressor(n_estimators=50, random_state=42)
init_mlb_rf.fit(X, y)
print(f"R² (training): {init_mlb_rf.score(X, y):.4f}")
R² (training): 0.9876

Out-of-bag (OOB) estimate

  • Each bootstrap sample draws \(N\) observations with replacement from the original \(N\)
  • Some observations will be selected multiple times, others not at all
  • On average, about \(63\%\) of observations end up in any given bootstrap sample
  • The remaining \(\approx 37\%\) are called out-of-bag (OOB) observations for that tree
  • For each observation \(i\), roughly \(B/3\) trees were built without seeing it
  • We can predict observation \(i\) using only those trees — giving a built-in test set estimate without needing cross-validation

OOB: Why 63%?

  • The probability that observation \(i\) is not selected in a single draw is \(\left(1 - \frac{1}{N}\right)\)
  • After \(N\) draws with replacement: \(P(\text{not in sample}) = \left(1 - \frac{1}{N}\right)^N \approx e^{-1} \approx 0.368\)
  • So \(P(\text{in sample}) \approx 1 - 0.368 = 0.632\), i.e. about \(63\%\)
  • This means each tree has a free validation set of ~37% of the data
  • The OOB error is computed by aggregating predictions for each observation using only the trees that did not include it in training

OOB in the MLB example

# Refit with oob_score=True to get OOB R²
oob_rf = RandomForestRegressor(n_estimators=50, oob_score=True, random_state=42)
oob_rf.fit(X, y)
print(f"R² (training):  {oob_rf.score(X, y):.4f}")
print(f"R² (OOB):       {oob_rf.oob_score_:.4f}")
R² (training):  0.9876
R² (OOB):       0.9144
  • The training R² is high because the model has seen this data
  • The OOB R² is a more honest estimate of performance on unseen data.

Tuning Hyperparameters

  • A model’s hyperparameters are settings chosen before training — they control how the model learns, not what it learns
  • Default values often work reasonably well, but tuning can significantly improve performance
  • Under-tuned model: may underfit (too simple) or overfit (too complex)
  • Well-tuned model: finds the sweet spot between bias and variance
  • Tuning is done via cross-validation: try different hyperparameter values, evaluate each on held-out folds, and pick the combination that generalizes best
  • This is especially important for ensemble methods where multiple hyperparameters interact with each other

Random Forest Hyperparameters

Parameter scikit-learn What it controls
Number of trees n_estimators More trees = more stable predictions, but slower
Features per split max_features Most important: controls \(m_{try}\), the randomness at each split
Max tree depth max_depth How deep each tree can grow (limits complexity)
Min samples to split min_samples_split A node must have at least this many observations to be split
Min samples in leaf min_samples_leaf Each leaf must contain at least this many observations
Bootstrap bootstrap Whether to use bootstrap sampling (True) or full dataset (False)
Max leaf nodes max_leaf_nodes Cap on total number of leaves per tree
  • max_features is the most important — it controls the bias-variance tradeoff
    • Small max_features: trees are more different (less correlated), but individually weaker
    • Large max_features: trees are stronger individually, but more similar to each other
    • Rule of thumb: \(p/3\) for regression, \(\sqrt{p}\) for classification

Tuning Random Forests

  • Important: max_features (equivalent to \(m_{try}\))
  • Marginal: tree complexity, splitting rule, sampling scheme
from sklearn.model_selection import GridSearchCV

p = X.shape[1]
param_grid = {
    "max_features": list(range(2, p + 1, 2)),
}
rf = RandomForestRegressor(
    n_estimators=500, random_state=42
)
cv_rf = GridSearchCV(
    rf, param_grid, cv=5,
    scoring="neg_root_mean_squared_error"
)
cv_rf.fit(X, y)
print(f"Best max_features: "
      f"{cv_rf.best_params_['max_features']}")
print(f"Best CV RMSE: "
      f"{-cv_rf.best_score_:.4f}")
Best max_features: 18
Best CV RMSE: 0.6467
import matplotlib.pyplot as plt
#| fig-align: center
results = pd.DataFrame(cv_rf.cv_results_)
fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(
    results["param_max_features"].astype(int),
    -results["mean_test_score"],
    marker="o",
)
ax.set_xlabel("max_features (mtry)")
ax.set_ylabel("CV RMSE")
ax.set_title("RF Tuning")
plt.tight_layout()
plt.show()

Variable Importance

  • After fitting a random forest, we want to know: which features matter most?
  • Two common approaches:
  1. Impurity-based (Gain): total reduction in the splitting criterion (e.g. RSS for regression, Gini for classification) each time a feature is used to split, averaged over all trees
  2. Permutation-based: randomly shuffle one feature’s values and measure how much the model’s accuracy drops — bigger drop = more important
  • Impurity-based importance is fast (computed during training) but can be biased toward high-cardinality features
  • Permutation importance is more reliable but slower (requires re-prediction)

Variable Importance: MLB Example

from sklearn.inspection import permutation_importance

# Impurity-based (gain)
gain_imp = pd.Series(
    oob_rf.feature_importances_, index=X.columns
).sort_values(ascending=True)

# Permutation-based
perm = permutation_importance(
    oob_rf, X, y, n_repeats=10, random_state=42
)
perm_imp = pd.Series(
    perm.importances_mean, index=X.columns
).sort_values(ascending=True)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
gain_imp.plot.barh(ax=axes[0])
axes[0].set_xlabel("Mean decrease in RSS (Gain)")
axes[0].set_title("Impurity-based")
perm_imp.plot.barh(ax=axes[1])
axes[1].set_xlabel("Mean decrease in R²")
axes[1].set_title("Permutation-based")
plt.tight_layout()
plt.show()

Boosting

Build ensemble models sequentially

  • start with a weak learner, e.g. small decision tree with few splits
  • each model in the sequence slightly improves upon the predictions of the previous models by focusing on the observations with the largest errors / residuals

Boosted trees algorithm

Write the prediction at step \(t\) of the search as \(\hat y_i^{(t)}\), start with \(\hat y_i^{(0)} = 0\)

  • Fit the first decision tree \(f_1\) to the data: \(\hat y_i^{(1)} = f_1(x_i) = \hat y_i^{(0)} + f_1(x_i)\)
  • Fit the next tree \(f_2\) to the residuals of the previous: \(y_i - \hat y_i^{(1)}\)

  • Add this to the prediction: \(\hat y_i^{(2)} = \hat y_i^{(1)} + f_2(x_i) = f_1(x_i) + f_2(x_i)\)

  • Fit the next tree \(f_3\) to the residuals of the previous: \(y_i - \hat y_i^{(2)}\)

  • Add this to the prediction: \(\hat y_i^{(3)} = \hat{y}_i^{(2)} + f_3(x_i) = f_1(x_i) + f_2(x_i) + f_3(x_i)\)

Continue until some stopping criteria to reach final model as a sum of trees:

\[\hat{y_i} = f(x_i) = \sum_{b=1}^B f_b(x_i)\]

Visual example of boosting in action

Gradient boosted trees

Regression boosting algorithm can be generalized to other loss functions via gradient descent - leading to gradient boosted trees, aka gradient boosting machines (GBMs)

Update the model parameters in the direction of the loss function’s descending gradient

Tune the learning rate in gradient descent

We need to control how much we update by in each step - the learning rate

Stochastic gradient descent can help with complex loss functions

  • Batch gradient descent computes the gradient using all \(N\) observations — expensive, and can get stuck in local minima

  • Stochastic GD randomly samples a subset of data each iteration

  • The gradient estimate is noisier, which actually helps:

    • Escape local minima and saddle points
    • Each update is cheaper to compute
    • Adds a regularization effect — noisy updates prevent overfitting

eXtreme gradient boosting with XGBoost

GBM/XGB Hyperparameters

Parameter XGBoost What it controls
Number of trees n_estimators Total boosting rounds; more trees = more expressive but risk overfitting
Learning rate learning_rate (\(\eta\)) Shrinkage per step; smaller = slower learning, needs more trees
Max depth max_depth Depth of each tree; controls interaction order (depth \(d\) captures \(d\)-way interactions)
Min child weight min_child_weight Minimum sum of instance weights in a leaf; acts like min_samples_leaf
Subsample ratio subsample Fraction of rows sampled per tree (stochastic gradient descent)
Column subsample colsample_bytree Fraction of features sampled per tree (similar to RF’s max_features)
L2 regularization reg_lambda (\(\lambda\)) Ridge penalty on leaf weights; prevents large predictions
L1 regularization reg_alpha (\(\alpha\)) Lasso penalty on leaf weights; encourages sparsity
Min split loss gamma (\(\gamma\)) Minimum loss reduction required to make a split; acts as pruning
  • n_estimators and learning_rate must be tuned together — lower learning rate needs more trees
  • Rule of thumb: set learning_rate small (0.01–0.1), then find the right n_estimators via early stopping
  • More work to tune than random forests, but GBMs offer more flexibility for different objective functions
  • In XGBoost, stochastic gradient descent is controlled by:
    • subsample: fraction of rows sampled per tree (e.g., 0.8 = 80%)
    • colsample_bytree: fraction of features sampled per tree
  • Both default to 1.0 (full data) — setting them below 1.0 enables stochastic updates

XGBoost example

from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")

xgb_param_grid = {
    "n_estimators": list(range(20, 201, 20)),
    "learning_rate": [0.025, 0.05, 0.1, 0.3],
    "max_depth": [1, 2, 3, 4],
}

xgb = XGBRegressor(
    objective="reg:squarederror",
    gamma=0,
    colsample_bytree=1,
    min_child_weight=1,
    subsample=1,
    random_state=1937,
    verbosity=0,
)

xgb_cv = GridSearchCV(
    xgb, xgb_param_grid, cv=5,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1,
)
xgb_cv.fit(X, y)
print("Best parameters:", xgb_cv.best_params_)
print(f"Best CV RMSE: {-xgb_cv.best_score_:.4f}")
print(f"Training R²:  {xgb_cv.best_estimator_.score(X, y):.4f}")
Best parameters: {'learning_rate': 0.3, 'max_depth': 1, 'n_estimators': 200}
Best CV RMSE: 0.4621
Training R²:  0.9951

XGBoost Variable Importance

xgb_fit_final = xgb_cv.best_estimator_

importances_xgb = pd.Series(xgb_fit_final.feature_importances_, index=X.columns)
importances_xgb = importances_xgb.sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(8, 6))
importances_xgb.plot.barh(ax=ax)
ax.set_xlabel("Importance")
ax.set_title("XGBoost Variable Importance")
plt.tight_layout()
plt.show()

Partial Dependence Plots (PDPs)

  • Variable importance tells us which features matter, but not how they affect predictions
  • Partial dependence plots show the marginal effect of a feature on the predicted outcome
  • How it works: for a feature \(x_j\), evaluate the model at each value of \(x_j\) while averaging over all other features: \[\hat{f}_j(x_j) = \frac{1}{N} \sum_{i=1}^{N} \hat{f}(x_j, \, x_{i,-j})\]
  • Flat line → feature has little effect on predictions
  • Steep slope → predictions are sensitive to that feature
  • Non-linear shape → model learned a relationship that linear regression would miss
  • PDPs work with any model (random forests, GBMs, etc.), not just XGBoost

Partial Dependence: MLB Example

This is the partial dependence parameter for the off variable

from sklearn.inspection import PartialDependenceDisplay

fig, ax = plt.subplots(figsize=(8, 5))
PartialDependenceDisplay.from_estimator(
    xgb_fit_final, X, features=["off"], ax=ax
)
ax.set_title("Partial Dependence: off")
plt.tight_layout()
plt.show()

Training and Testing: The Big Picture

A proper ML workflow separates data into distinct roles:

  1. Split the data into a training set (~70–80%) and a test set (~20–30%) before doing anything
  2. Tune hyperparameters using only the training set (via cross-validation or OOB)
  3. Refit the final model on the full training set with the best hyperparameters
  4. Evaluate on the held-out test set — this is your honest estimate of real-world performance
  • The test set must be completely untouched during training and tuning — otherwise your performance estimate is biased
  • If you tune on the test set, you are effectively fitting to the test data and your reported metrics will be overly optimistic
  • Next week: more on cross-validation strategies, train/validation/test splits, and how to avoid data leakage