The problem
Most book recommendation systems rely on collaborative filtering ("users who bought X also bought Y") or simple keyword matching. Both approaches miss the nuance of what makes a reader actually enjoy a book. Someone searching for "a story about overcoming loss" needs semantic understanding, not a title keyword match. And two books in the same genre can feel completely different depending on the emotional tone.
I wanted to build a system that understands what a book is about and how it feels, then uses both signals to surface recommendations that actually resonate.
How it works
The system stacks three AI models into a single recommendation pipeline:
1. Semantic search with OpenAI embeddings + ChromaDB
Every book description in the dataset gets transformed into a 1,536-dimensional vector using OpenAI's text-embedding-ada-002 model. These vectors live in a ChromaDB instance that persists locally, so the embedding API only gets called once during the initial indexing. After that, all similarity searches run against the local vector store with zero API cost.
When a user types a natural language query like "books about finding purpose after retirement," the system embeds that query and retrieves the top 50 semantically similar books from ChromaDB via LangChain's similarity search interface.
2. Emotion-based filtering with DistilRoBERTa
During the data preparation phase, every book description passes through j-hartmann/emotion-english-distilroberta-base, a fine-tuned transformer model that scores text across five emotional dimensions: joy, sadness, fear, anger, and surprise. Each dimension gets a score between 0.0 and 1.0.
These emotion scores are pre-computed and stored alongside the book metadata. At query time, if a user selects a tone filter (Happy, Sad, Suspenseful, etc.), the results from the semantic search get re-sorted by the matching emotion score. A "suspenseful" filter sorts by fear score descending. A "happy" filter sorts by joy score descending.
3. Zero-shot category classification with BART-MNLI
The original Kaggle dataset had inconsistent category labels. Instead of manually re-labeling 7,000 entries, I ran every book through facebook/bart-large-mnli for zero-shot classification into four clean categories: Fiction, Nonfiction, Children's Fiction, and Children's Nonfiction. This gave the frontend a reliable filter axis without any manual annotation work.
The pipeline at query time
The pipeline runs fast because the expensive operations (embedding all 5,200 books, computing emotion scores, running zero-shot classification) happen once during data preparation. At query time, the only API call is embedding the user's query string, which takes under 200ms.
Data preparation
The raw dataset from Kaggle contained about 7,000 books with metadata (title, author, description, categories, ratings, cover images). Cleaning brought that down to roughly 5,200 usable entries after removing books with missing descriptions, duplicate ISBNs, and entries too short to produce meaningful embeddings.
The preparation pipeline runs through five Jupyter notebooks:
| Notebook | Purpose |
|---|---|
data-exploration.ipynb | EDA, missing value analysis, distribution checks |
sentiment-analysis.ipynb | Emotion scoring with DistilRoBERTa across all books |
text-classification.ipynb | Zero-shot category labeling with BART-MNLI |
vector-search.ipynb | Building and persisting the ChromaDB vector store |
test-and-explore.ipynb | End-to-end query testing and result validation |
Each notebook produces artifacts that feed into the next. The final output is a tagged CSV plus a persisted ChromaDB directory, both stored in the backend's data/ folder.
Architecture
The system splits into two services:
Backend (FastAPI + Uvicorn)
Six REST endpoints handle everything from basic book lookups to the core recommendation logic. The /recommendations endpoint accepts a JSON body with the query string, optional category filter, optional tone filter, and two parameters controlling retrieval depth (initial_top_k for the vector search and final_top_k for the returned results).
The recommendation service initializes the ChromaDB connection on startup, loading the persisted vector store from disk. If the store doesn't exist yet, it rebuilds from the tagged descriptions file using LangChain's Chroma.from_documents().
Frontend (React 19 + Vite 7 + Tailwind CSS v4)
The UI gives users three control surfaces: a natural language search bar, a category dropdown, and a tone selector. Results render as book cards showing cover images, titles, authors, ratings, and truncated descriptions. The search bar also supports real-time autocomplete via the /search endpoint for users who prefer browsing by title or author.
What I learned
Building this project clarified a few things about combining multiple AI models in one system:
Pre-computing everything possible is the key to keeping inference cheap. The emotion scores and category labels are static properties of each book. Computing them at query time would have added 3+ seconds of transformer inference per request. Doing it once during data prep and storing the results turned the runtime pipeline into a vector lookup plus some array sorting.
Local persistence for ChromaDB eliminated recurring embedding costs. The first run embeds all 5,200 descriptions (roughly $0.50 in API calls). After that, the vector store loads from disk in under a second. Without persistence, every server restart would re-embed the entire dataset.
Zero-shot classification is surprisingly practical for messy categorical data. The BART-MNLI model gave clean, consistent labels across the entire dataset with no training data and no manual labeling. For a dataset this size, that saved hours of annotation work.
Tech stack
| Layer | Technology |
|---|---|
| Embeddings | OpenAI text-embedding-ada-002 |
| Vector store | ChromaDB with LangChain integration |
| Emotion model | j-hartmann/emotion-english-distilroberta-base |
| Category model | facebook/bart-large-mnli (zero-shot) |
| Data processing | Pandas, NumPy, Jupyter |
| Backend | FastAPI, Uvicorn, Pydantic |
| Frontend | React 19, Vite 7, Tailwind CSS v4, Axios |
| Dataset | 7K Books with Metadata (Kaggle) |
