title: "Building a search index in 200 lines of TypeScript"
date: 2025-02-02
tags: [engineering]
reading_time: 12 min
slug: building-a-search-index-in-200-lines-of-typescript
---
Building a search index in 200 lines of TypeScript
Every few months I see someone reach for Elasticsearch to add search to a site that has maybe three thousand documents. This is roughly equivalent to buying a crane to hang a picture.
Most sites need three things from search: find documents containing a word, rank by relevance, make it fast enough that nobody notices. All three fit comfortably in ~200 lines of TypeScript.
The inverted index
Forget everything about databases for a second. The core data structure in search is embarrassingly simple:
type Index = Map<string, Map<DocId, number>>
// term -> { docId -> frequency }
For every word, you keep a list of documents containing it and how often. That's it. Building it is two nested loops over your documents.
function buildIndex(docs: Doc[]): Index {
const idx: Index = new Map()
for (const doc of docs) {
for (const term of tokenize(doc.text)) {
const postings = idx.get(term) ?? new Map()
postings.set(doc.id, (postings.get(doc.id) ?? 0) + 1)
idx.set(term, postings)
}
}
return idx
}
Searching is just a set intersection of the postings lists for each query term. If that sentence sounded scary in school, it's worth noticing you already use Map and Set.
Ranking
Raw matches are useless ("the" matches everything). You want BM25 — a ranking function that rewards rare terms, penalises long documents, and was designed in the late 70s.
The full formula looks intimidating. The intuition isn't:
A term is more relevant the more it appears in this document, and less relevant the more it appears in every document.
That's it. BM25 formalises that intuition with two tuning constants (k1, b) that almost no one needs to tune.
Performance
For a corpus under ~50k documents, the index fits in memory, queries are sub-millisecond, and you never need to think about it again. The day you cross that threshold is the day to look at a real engine. Until then, you're running a distributed cluster to power a search box that queries less than a DSLR camera's worth of text.
When not to do this
Don't roll your own if you need fuzzy matching, multi-language tokenisation, or faceted filtering at scale. Don't roll your own if someone else on the team has to maintain it.
But for a blog, a docs site, a small internal tool — 200 lines is less code than the YAML you'd write to configure Elasticsearch.