MSIB Sentiment Analysis

Overview

MSIB is the Indonesian government's nationwide internship program, running hundreds of thousands of students through private-sector placements each cycle. The program leadership had a problem that traditional feedback channels couldn't solve: is it getting better or worse over time, and why? Post-program surveys captured a curated slice of voices. The honest, unfiltered version lives on Twitter.

I scraped four years of public tweets mentioning MSIB and "magang" (2021 to 2024), ran them through a sentiment pipeline tuned for Indonesian, and built a Power BI dashboard that program administrators could actually navigate. The goal was a tool that answered "what are participants really saying?" without requiring a data analyst to interpret it.

Why this is harder in Indonesian

Most sentiment analysis tutorials assume English input. Indonesian breaks those assumptions in three specific ways:

Morphology is affixation-heavy. The root word ajar (learn) spawns belajar, mengajar, pembelajaran, pelajari, diajari, and more. Without proper stemming, each form is treated as a separate token and semantic links are lost.
Social Indonesian is informal and mixed. Tweets contain slang (gue, lu, anjir), abbreviations (yg, dgn, blm), and English code-switching (internship, anxiety, network). English-trained models miss all of it.
Pretrained Indonesian models are rare. You can't just grab HuggingFace's distilbert-base and expect it to work on Bahasa.

The pipeline had to address all three, not just pick a model.

Pipeline

Four stages, each handling one failure mode of generic NLP tooling on Indonesian:

2,000+

Tweets collected

Years (2021-2024)

Sentiment classes

SMSA

Fine-tune corpus

1. Collection. Twitter/X API with queries targeting MSIB and "magang" keywords, separated into year-bucketed CSVs so temporal analysis stayed clean. magang2021.csv, magang2022.csv, and so on. Raw archives preserved for reproducibility.

2. Preprocessing. Custom normalization pipeline: lowercasing, URL + mention + hashtag stripping, punctuation cleanup. NLTK supplied the Indonesian stopword list (after adding social-media-specific filler I curated myself). Sastrawi handled stemming, which is non-negotiable for Indonesian where affixes carry most of the morphological weight.

3. Classification. The ayameRushia/roberta-base-indonesian-1.5G-sentiment-analysis-smsa model from Hugging Face. A RoBERTa variant specifically fine-tuned on the SMSA (Sentiment in Social Media for Analysis) corpus. Using a generic English model and translating wouldn't have worked: Indonesian sentiment nuance gets flattened in translation. The model outputs label + confidence, and I kept confidence scores for downstream filtering.

4. Visualization. Power BI for the dashboard because the target audience (program administrators, not analysts) uses Microsoft's stack daily. Donut chart for sentiment distribution, line chart for 2021 to 2024 trends, word clouds for positive vs negative themes, and a filterable tweet explorer for individual drill-down.

What the data said

Positive tweets clustered around keywords like learning experience, skill development, networking. Negative tweets clustered around application process, selection difficulty, workload. The specificity matters: "workload" is actionable for program design in a way that "I hated it" isn't.

The temporal trend was genuinely useful. Positive sentiment rose from 2021 to 2024, consistent with narrative improvements to the program (more company variety, better onboarding, clearer evaluation rubrics). The dashboard made this trend visible without requiring administrators to run queries.

Dashboard design choices

Three things I did in the Power BI design worth pulling out:

Summary page first. Program administrators open a dashboard with one question in mind. The landing page answered it in three charts. Drill-down exists but isn't required.
Tooltip contextual help. Each visualization has a tooltip explaining what it shows and how to read it. Lowered the learning curve for non-technical users significantly.
DAX over raw filters. Performance on 2,000+ rows with multiple cross-filters degraded fast with naive relationships. Proper DAX measures for sentiment counts and period-over-period deltas made interactions instant.

What I learned

Language-specific pipelines aren't optional. Sastrawi + NLTK + Indonesian RoBERTa is a non-substitutable stack for Bahasa sentiment work. Generic English tools produce confident nonsense.
Stopwords are curated, not inherited. The NLTK Indonesian list was a starting point. Social media needed extensions (emoji handlers, abbreviation normalization, slang dictionaries) to get clean tokens.
Temporal is where insight lives. A single-snapshot distribution is barely interesting. The 2021→2024 delta is where administrators found decisions.
Design dashboards for the audience you have. Power BI because that's what the program office used. Interactive but opinionated: the summary page does the work before the user even filters.
Model choice beats feature engineering for classification. Fine-tuned Indonesian RoBERTa outperformed hand-crafted features on TF-IDF + classical classifiers by a wide margin. The right pretrained model is usually the best 30 minutes you'll spend.

Full write-up with the preprocessing function and dashboard screenshots is on Medium. Source on GitHub.

MSIB Program Sentiment Analysis: 2,000 tweets on Indonesia's internship program