House Price Prediction

Overview

The Kaggle "House Prices: Advanced Regression Techniques" competition is the classic bridge from toy ML tutorials to real-world modeling. 79 explanatory variables, 1,460 training rows describing houses in Ames, Iowa, and a sale price to predict. My final submission didn't reach the top of the leaderboard. That wasn't the point.

The point was building the systematic ML workflow I'd keep using after the competition ended: preprocessing pipelines that handle messy missing data, feature engineering that actually helps instead of inflating dimensionality, model evaluation that survives cross-validation, and ensembles that work because the individual models disagree meaningfully. The Kaggle score became a check, not a target.

The 79 features, honestly

Toy datasets have five clean columns and no missing values. This one has 79 columns spanning integer counts, categorical ratings, ordinal scales, timestamps, and free-form strings. Missing data is pervasive and semantically loaded:

The preprocessing pipeline encoded the distinction explicitly: structural absence became a category like "None", true randomness got mean imputation plus a missing-indicator flag so the model could detect patterns in the absence itself. That indicator flag trick alone meaningfully improved RMSE on linear models.

Feature engineering

The real work. Six techniques, each fixing a specific failure mode of naive preprocessing:

Raw features

Log + YJ

Skew transforms

Target

Categorical encoding

Lasso

Feature selection

Temporal transformations. Raw YearBuilt is not what you want. A house built in 1960 and sold in 1970 is "10 years old at sale." One built in 2000 sold in 2010 is also 10 years old. Converting to relative age unified them.

Quality ordinal mappings. ExterQual values (Po, Fa, TA, Gd, Ex) carry implicit order. Encoding them as 1 to 5 preserves the rank; one-hot encoding throws it away.

Skew transformations. Many numeric features (lot area, living area, sale price) are right-skewed. Log transformation for strictly positive features, Yeo-Johnson for features with zeros. Linear models that assume normality went from unusable to competitive after this one step.

Target-based encoding. High-cardinality categoricals like Neighborhood have too many classes for one-hot. I encoded each category by its mean sale price in the training set (carefully, using cross-validation folds to avoid target leakage). Neighborhood became one numeric column carrying the price signal.

Rare-category grouping. Categories appearing in less than 1% of rows got bucketed as "Rare". Prevents the model from memorizing noise in categories it's seen three times.

Lasso-based feature selection. After engineering, I had more columns than before. SelectFromModel(Lasso(alpha=0.001)) dropped the ones carrying no predictive signal, reducing dimensionality without hand-picking.

Model bakeoff

Eleven algorithms, same preprocessing pipeline, same 5-fold CV, same three metrics (RMSE, MSE, R²):

Linear family. Linear Regression, Ridge, Lasso, ElasticNet.
Tree family. Random Forest, Gradient Boosting, XGBoost, LightGBM, CatBoost.
Others. SVR, K-Nearest Neighbors.

The surprise: well-preprocessed linear models landed within striking distance of the tree-based methods. Lasso with the full feature engineering pipeline was competitive with out-of-the-box XGBoost. The skew transforms and target encoding were doing most of the heavy lifting.

The tree-based methods still won. They handle interactions automatically without needing to hand-craft them. But by a smaller margin than I'd expected.

Ensemble

Simple averaging of the top performers: XGBoost, LightGBM, Gradient Boosting, and the tuned Lasso. np.mean(predictions, axis=1) across their test-set outputs.

What made this ensemble work wasn't the averaging trick; it was that the base models made different kinds of errors. Linear models underfit complex interactions but overfit less on tail categories. Tree methods capture interactions but overfit on rare splits. Averaging them regresses their errors toward the mean of their biases, not toward zero.

What I learned

Feature engineering > model choice. The jump from raw features to log-transformed + target-encoded + Lasso-selected features was bigger than the jump from Ridge to XGBoost.
Missing data is semantic, not statistical. "Why is this NA?" beats "which imputation technique?"
Linear methods deserve a second look. With good preprocessing they're interpretable, fast, and competitive. Reaching for XGBoost first is a habit worth questioning.
Ensemble diversity is everything. Four similar models don't help. Two very different models usually do.
The workflow matters more than the score. A reproducible pipeline (preprocess → evaluate → tune → ensemble) is the transferable asset. The Kaggle score is a side effect.

Full write-up with the preprocessing pipeline and per-model CV tables is on Medium. Source on GitHub.

House Price Prediction: the Kaggle classic as an ML workflow laboratory