Homework 4 - Machine Learning and Parallel Computing

Due Date

March 29, 2026 by 11:59pm.

Learning Objectives

Apply feature engineering to prepare a real-world dataset for modelling.
Tune hyperparameters correctly using cross-validation on the training set.
Compare ML models on a regression task.
Parallelize embarrassingly parallel workloads using joblib.

Deliverables

Please answer all questions and interpret your findings. Upload both the .qmd and rendered .html to Quercus.

Part 1: Machine Learning with the Ames Housing Dataset

The Ames Housing Dataset contains 81 features describing residential properties sold in Ames, Iowa between 2006 and 2010. Our goal is to predict SalePrice. Please use fetch_openml to get the data from OpenML rather than Kaggle, https://www.openml.org/search?type=data&status=active&id=43926 (use fetch_openml(name=“house_prices”, version=1, as_frame=True, parser=“auto”))

Question 1: Data Preparation and Feature Engineering (14 points)

1a. (3 points) Explore the target variable and the features.

Plot the distribution of SalePrice and log(SalePrice) side by side. Report the skewness of each and indicate whether logging might be preferred for modeling.
Print a summary table of the categorical columns showing: column name, number of unique values, and first few values. Use this to identify which columns are truly nominal (no order, e.g. Neighborhood) vs ordinal quality ratings (e.g. ExterQual with values Ex/Gd/TA/Fa/Po). Write a brief interpretation.

1b. (8 points) Handle missing values, engineer features, encode ordinal columns, set the target, and binary encode the remaining nominal categoricals. Complete all steps:

Fill numeric missing values with 0; fill categorical missing values with 'None'.
Create the derived features below, then drop the listed component columns:

New feature	Formula	Columns to drop
`TotalSF`	`TotalBsmtSF + 1stFlrSF + 2ndFlrSF`	`TotalBsmtSF`, `1stFlrSF`, `2ndFlrSF`
`HouseAge`	`YrSold − YearBuilt`	`YearBuilt`
`RemodAge`	`YrSold − YearRemodAdd`	`YearRemodAdd`

Note: keep YrSold (captures market conditions). PoolArea and GarageArea already encode existence and size — no binary flag needed.

Ordinal encode the quality/condition columns by mapping their string levels to integers. Many features use the scale Ex > Gd > TA > Fa > Po, and encoding them as numbers (5–1) preserves the ordering that dummy variable encoding would discard. Use the mappings below:

qual_map = {'None': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
qual_cols = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
             'HeatingQC', 'KitchenQual', 'FireplaceQu',
             'GarageQual', 'GarageCond', 'PoolQC']

other_ordinal = {
    'BsmtExposure': {'None': 0, 'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4},
    'BsmtFinType1': {'None': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6},
    'BsmtFinType2': {'None': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6},
    'GarageFinish': {'None': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3},
    'Fence':        {'None': 0, 'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv': 4},
    'Functional':   {'Sal': 1, 'Sev': 2, 'Maj2': 3, 'Maj1': 4,
                     'Mod': 5, 'Min2': 6, 'Min1': 7, 'Typ': 8},
    'LotShape':     {'IR3': 1, 'IR2': 2, 'IR1': 3, 'Reg': 4},
    'LandSlope':    {'Sev': 1, 'Mod': 2, 'Gtl': 3},
    'PavedDrive':   {'N': 0, 'P': 1, 'Y': 2},
}

Use log SalePrice as your target and drop SalePrice. Create dummy variables of the remaining object-dtype columns (the nominal categoricals) using pd.get_dummies(...). Print the final feature count and summarize the feature engineered variables (but no need to summarize all of the dummy variables).

1c. (3 points) Split into 70% train and 30% test (set random_state=370). Print the shapes of X_train, X_test, y_train, y_test. Report the mean and standard deviation of y_train and y_test.

Question 2: Baseline — Ridge Regression (10 points)

2 (10 points) Write a brief description of what ridge regression is, and then fit RidgeCV model using alphas=np.logspace(-3, 3, 50) and cv=5. Explain what you are tuning and report the chosen alpha_. Then estimate the 5-fold CV RMSE using the chosen alpha. The on the test set report R$^2$ and RMSE (log scale) and convert log-scale predictions back to dollars to report test R$^2$ and RMSE in dollars. Interpret the model performance metrics.

Question 3: Random Forest (12 points)

3a. (10 points) Tune a RandomForestRegressor by trying two different values for n_estimators, three different max_features, two different max_depth values, and two different min_samples_leaf values (24 combinations). Write a function fit_rf_cv(n_est, max_feat, max_d, min_leaf) that fits the model and returns the CV RMSE, then run the grid search both serially and in parallel with Parallel(n_jobs=4). Keep random_state=370 and n_jobs=1 inside the model. Report serial time and speedup time. Report the best combination in a table. Then fit the best Random Forest on the full training set, report test R$^2$ and RMSE (log scale and dollars). Interpret and discuss.

3b. (2 points) Plot the top 15 features. Do the most important features make intuitive sense for predicting house prices?

Question 4: XGBoost with Parallel Hyperparameter Search (12 points)

4a. (8 points) Define a hyperparameter grid with at least 3 values for each of max_depth, learning_rate, n_estimators, and subsample (minimum 81 combinations). Print the total number of combinations. Write a function fit_xgb_cv(params) that builds an XGBRegressor with the given params (plus n_jobs=1, random_state=370, verbosity=0), runs 5-fold CV with scoring='neg_root_mean_squared_error' and n_jobs=1 on (X_train, y_train), and returns (params, mean_cv_rmse). Run the grid search both serially and in parallel with Parallel(n_jobs=4). Report serial time and parallel speedup. Find the best hyperparameter combination (lowest mean CV RMSE), refit on the full training set, and report test R$^2$ and RMSE in log scale and dollars. Interpret and discuss.

4b. (2 points) Now try inner-level parallelism: write fit_xgb_cv_inner(params) that uses n_jobs=4 on the XGBRegressor itself (so each model uses 4 threads internally) and n_jobs=1 in cross_val_score, then run the grid search serially. Report wall-clock time and compare it to the serial (n_jobs=1) and outer-parallel (Parallel(n_jobs=4)) approaches from 4a. Which strategy is fastest and why?

4c. (2 points) Plot the top 15 features. How do they compare to the random forest variable importance?

Question 5: Model Comparison (6 points)

5a. (6 points) Assemble a summary table with columns: Model, Best CV RMSE (log), Test RMSE (log), Test RMSE ($), Test R2. Include Ridge, Random Forest, and XGBoost. On a single figure with three side-by-side panels, plot predicted vs actual SalePrice (in dollars) for Ridge, Random Forest, and XGBoost. Add a 1-to-1 reference line to each panel. Interpret and discuss your findings.

Part 2: Parallel Functions

Question 6: Monte Carlo pi Estimation (12 points)

6a. (4 points) Write estimate_pi(N, seed=None) that:

Draws N uniform random points in [0,1] × [0,1]
Counts how many satisfy x² + y² ≤ 1 (inside the unit circle)
Returns 4 × (count inside) / N

Run it with N = 1,000, 100,000, and 1,000,000. For each, report the estimate, the absolute error from np.pi, and the serial time. How does accuracy scale with N? How does run time scale?

6b. (4 points) Run estimate_pi(N=100_000) serially with seeds 0–1999 (2,000 runs). Record the serial time. Report the mean and standard deviation of the estimates.

6c. (4 points) Repeat 6b in parallel with Parallel(n_jobs=4, prefer="threads"). Report speedup. Verify the mean estimate is within 0.001 of the serial result. Why is prefer="threads" a good choice here? Is the speedup closer to 4x compared to the XGBoost grid search? Why?

Question 7: Bootstrap Confidence Interval for Test RMSE (12 points)

Bootstrapping gives a confidence interval for any model metric without making distributional assumptions. Each resample is independent, making it naturally parallel.

7a. (4 points) Using the best XGBoost model fitted in Question 4, write a function boot_rmse(idx) that:

Selects rows idx from X_test and y_test (resample with replacement)
Predicts with the best XGBoost model
Returns the RMSE on the resample (log scale)

7b. (4 points) Generate 1,000 bootstrap index arrays using a fixed random seed, run boot_rmse serially, and compute the 95% percentile bootstrap CI for test RMSE. Time the computation.

7c. (4 points) Run the same 1,000 bootstrap resamples in parallel with Parallel(n_jobs=4, prefer="threads"). Compare the CI to 7b and report speedup. Why is prefer="threads" important here, what happens if you use the default process backend instead? Do the serial and parallel CIs agree? Should they, given that both use the same pre-generated index arrays?