Chatbot University

Overview

Udayana University has eight faculties, fifty-plus study programs, and a FAQ page that nobody on staff had touched in years. Students trying to confirm whether a specific program accepts gap-year applicants would either click through seven policy PDFs or call the admin office during the four-hour window it was actually staffed.

Generic chatbots weren't an option. A model that hallucinates a deadline sends a student to register for the wrong semester. Wrong answers here have real consequences: missed applications, wrong course loads, forfeited scholarships.

The system needed to answer from actual university documents, in Indonesian (the language of the questions), and admit when it didn't know. That meant RAG grounded in official data, not a fine-tuned model making confident guesses.

Architecture

Flask backend, LangChain orchestration, OpenAI embeddings stored in ChromaDB, GPT-3.5 Turbo for generation. Five stages, each small enough to swap when I found a better approach.

1000

Chunk size (chars)

200

Overlap (chars)

1536

Embedding dim

k=5

Top retrieved

Document ingestion was the first real problem. University policy PDFs interleave tables, bullet lists, and dense prose, and a naive splitter chops sentences mid-thought. RecursiveCharacterTextSplitter with a priority list of separators (\n\n, \n, . , ) plus a 200-character overlap preserved enough context across chunk boundaries for retrieval to stitch multi-part answers back together.

OpenAI's text-embedding-ada-002 handled the encoding into 1536-dimensional semantic vectors (meaning-based, not keyword-based). ChromaDB persisted them on disk with a directory structure, so knowledge-base updates didn't require reindexing the world. Top-5 similarity search feeds the retriever, and GPT-3.5 Turbo generates the final response from the retrieved context plus conversation history.

Grounding over generation

The prompt template is the hinge that makes this pipeline trustworthy instead of creative. Three explicit directives:

Answer only from the retrieved context. If the context doesn't contain the answer, say so. No synthesis from training data.
Respond in Indonesian. Students ask in Indonesian. The model defaults to English unless steered explicitly.
Format for readability. Short paragraphs, bullet lists for multi-step processes. Default LLM output is a wall of text.

The "say when you don't know" part was counterintuitive at first. It means the bot loses answers instead of faking them. But it's exactly why staff were willing to route students here for routine questions. A bot that bluffs is worse than no bot at all.

Why RAG over fine-tuning

Fine-tuning would have meant thousands of Q&A pairs, weeks of training iterations, and retraining every time the university amended a policy. RAG needed the raw documents and a chunking strategy. Updates became "edit the source file, rebuild the vector index". No model surgery.

RAG also gives source transparency. When the chatbot answers, the retrieved chunks are attributable. Staff can verify the response was grounded in the current student handbook, not a 2019 version the model happened to memorize during pretraining.

Numbers

Measured against a curated question set spanning the official student handbook, registration guides, and scholarship pages:

92%

Answer accuracy

1.8s

Avg response time

88%

Document relevance

<500MB

Memory footprint

First-month results at deployment: 5,000+ interactions, 85% user satisfaction, 40% reduction in routine admin inquiries. The memory footprint number matters because the deployment target was a modest campus server, not a GPU cluster. The whole system runs on a single Python process with ChromaDB persisted next to it.

What I learned

RAG beats fine-tuning for institutional knowledge. When data changes and accountability matters, retrieval is the right abstraction.
Grounding is a product feature. "I don't know" is a useful answer. Forcing the model to admit gaps is what made staff trust the system.
Prompt language matters. If users ask in Indonesian, the model must be told explicitly to answer in Indonesian. Defaults bias toward English.
Chunking is where quality dies. Tuning chunk size and overlap deserves as much attention as choosing the LLM.
Ship the source. Attributable retrievals make staff adoption easier. They can verify before trusting.

Full write-up with architecture notes and the Indonesian-prompt-engineering details is on Medium. Source on GitHub.

Udayana University AI Assistant: RAG for campus information access

Overview

Architecture

Grounding over generation

Why RAG over fine-tuning

Numbers

What I learned

Interested in working together?