The brief
Dinas PUPR Kabupaten Bangli (the Public Works office for Bangli Regency in Bali) commissions infrastructure projects across four districts. Some of those projects finish on schedule. Many don't. The question my S2 thesis aimed to answer was a planning question disguised as a prediction question: given the contract attributes, the contractor profile, and the project characteristics at kickoff, can we tell which projects are likely to be late?
A working answer would let the office triage upcoming projects before they break ground, allocate oversight to the higher-risk ones, and frame conversations with contractors using a quantitative signal rather than gut feel.
The data
The available source data was synthetic, generated to match the distribution of real projects across the four Bangli districts from 2021 to 2024. 484 projects total, with 16 raw fields per project covering contract value, contract duration, contractor identity, project type, district, year, primary obstacle category, and outcome status.
The class distribution was 78/22 (78% on-time, 22% delayed). That imbalance mattered for everything that came after.
Pipeline
End-to-end Python, structured as five stages.
1. Preprocessing
Standard cleanup: type coercion, missing value handling, outlier checks against the contract value distribution. Indonesian field names normalized to consistent snake_case for downstream code clarity.
2. Feature engineering
Two derived features that materially helped the model:
Nilai_Kontrak_per_Hari(Contract value per day): the contract value divided by the planned duration. This proxies project intensity. A high-value short-duration project is structurally riskier than a same-value project with a longer planned timeline.Log_Nilai_Kontrak(Log contract value): a log transform of contract value to compress the long right tail. Contract values in the dataset spanned three orders of magnitude, which would have let the largest projects dominate split-finding in the trees.
3. Encoding and split
Eight categorical features got label-encoded. The split was stratified train/test (80/20) on the outcome label so both folds preserved the 78/22 class distribution. Without stratification, a small test set risked having too few delayed cases to evaluate honestly.
4. Hyperparameter tuning
RandomizedSearchCV over 100 iterations with 5-fold StratifiedKFold cross-validation, optimizing F1 rather than accuracy.
The choice of F1 was deliberate. The first model I tried, with default hyperparameters, hit 78% accuracy and looked like a good result on paper. It wasn't. The model was predicting "on-time" for almost every project, which on a 78/22 dataset trivially gets you 78% accuracy. F1 collapsed below 0.4. The model had learned the prior, not the signal.
class_weight='balanced' plus F1-driven tuning fixed it. The final model carried the imbalance through training and got rewarded for actually catching delays.
5. Final model
Random Forest, tuned. 73% accuracy, 0.83 F1 score, evaluated on the held-out test set. The accuracy number is lower than the naive baseline because the tuned model is actively predicting the minority class instead of suppressing it. That tradeoff is the entire point.
What predicted delays
Feature importance from the final model, ranked:
- Kendala Utama (primary obstacle category). The strongest predictor by a clear margin. Projects flagged with land acquisition or coordination obstacles at kickoff were structurally likely to slip.
- Nilai Kontrak (contract value). Larger contracts tended to be more delay-prone, partly mediated by the next feature.
- Durasi Rencana (planned duration). Both extremes correlated with delay risk: very short timelines for the work scope, and very long timelines that increased exposure to upstream changes.
Six publication-ready visualizations went into the thesis: confusion matrix, cross-validation score distribution, feature importance bar chart, correlation heatmap, ROC curve, and precision-recall curve. The feature importance chart was the one that landed best with the office because it translated directly into a triage rubric.
The iteration story
The final binary classifier wasn't the first model I tried. The path through three approaches matters because it shows what client feedback changes:
- Regressor (predict delay duration in days). R² of 0.085 on the test set. The model could not learn duration as a continuous target on this dataset; the distribution of delay days was too long-tailed and zero-inflated. Reported back, recommended a different framing.
- Four-class multiclass classifier (no delay / minor / moderate / major). 28% accuracy. Indistinguishable from random on a four-class problem. The office's natural thresholds (what counts as "minor" versus "moderate") didn't match anything the data could separate.
- Binary classifier (delayed yes/no). 73% accuracy, 0.83 F1. The framing the data could support, and the framing the office could act on.
Communicating this iteration honestly was load-bearing. The temptation in a thesis context is to present only the working model. The office got more value from understanding why two earlier attempts didn't work, because that informed where to invest in better data collection going forward.
Serving the model
A FastAPI backend wrapped the trained model with two endpoints.
# POST /predict
# Request body: integer IDs for categorical inputs, plain floats/ints for numeric
{
"districtId": 2,
"projectTypeId": 5,
"contractorId": 14,
"primaryObstacleId": 3,
"contractValue": 1_250_000_000,
"plannedDurationDays": 180,
"year": 2024
}
# Response envelope
{
"status": "success",
"message": "Prediction generated.",
"data": {
"prediction": 1, # 0 = on-time, 1 = delayed
"probabilities": { "on_time": 0.31, "delayed": 0.69 },
"topFeatures": [
{ "name": "Kendala Utama", "importance": 0.34 },
{ "name": "Nilai Kontrak", "importance": 0.21 },
{ "name": "Durasi Rencana", "importance": 0.17 }
]
},
"errors": null
}
# GET /predict/metadata
# Returns dropdown options for the frontend, indexed by stable integer IDs.
{
"districts": [{ "id": 1, "name": "Bangli" }, ... ],
"projectTypes": [{ "id": 1, "name": "Jalan" }, ... ],
"contractors": [{ "id": 1, "name": "..." }, ... ],
"primaryObstacles": [{ "id": 1, "name": "Pembebasan Lahan" }, ... ]
}
Two contract decisions worth surfacing:
- Integer IDs for categorical inputs. The natural alternative was string keys (
"district": "Bangli"). I pushed for IDs because the frontend developer was new to the project's terminology and the dataset had real case-sensitivity quirks ("Bangli" versus "bangli" versus null) that would have surfaced as production bugs. IDs sidestep all of that. - Envelope response format. Every response carries
status,message,data, anderrors. Slightly more typing on the consumer side, much easier debugging when something goes wrong because the failure shape is consistent across endpoints.
The whole service runs in Docker (Uvicorn + FastAPI behind a simple gateway) and is documented in a README that covers setup, the request/response examples above, and a model card describing performance, training data, and known limitations.
Lessons
-
Imbalanced classification needs F1 from the start. Default accuracy will reward you for predicting the prior. Switching to F1, plus
class_weight='balanced', plus stratified CV, makes the model actually solve the problem instead of pretending the problem doesn't exist. -
Iteration is the deliverable, not just the final model. The office got value from understanding why the regressor and the four-class classifier failed, because that mapped to "what kinds of data would we need to support those framings." A clean final number with no story is less useful than a useful number with the path that led to it.
-
Synthetic data is a starting condition, not an excuse. The dataset was synthetic by necessity (real records weren't yet centralized). The model card documents this clearly and frames the F1 as a methodology demonstration. The next phase of the project, after the thesis, is collecting real outcomes against the model's predictions to refine on actual data.
-
API contracts deserve real design time. The 30 minutes spent locking down envelope shape, integer IDs, and field naming saved hours of back-and-forth with the frontend developer. Picking these conventions early and writing them down is one of the highest-leverage things a backend developer does on a small team.
