How Search Engines Work
Ever wondered what’s happening under the hood of Elasticsearch, Lucene, or Typesense? I recently went down that rabbit hole and built a tiny search engine in Go. Here’s a breakdown of the key concepts that power every full-text search system.
Inverted Index: The Heart of Search
Normally, a “forward index” looks like this:
doc1 → ["go", "makes", "search", "fun"]
doc2 → ["building", "search", "engines", "in", "go"]
That’s great if you want to read the doc.
But search engines flip it around into an inverted index:
"go" → [doc1, doc2]
"search" → [doc1, doc2]
"engines" → [doc2]
Posting Lists: Beyond Just Doc IDs
A posting list is the array of documents where a term appears. At its simplest:
"go" → [2, 5, 9, 15]
But real systems store much more for ranking and phrase queries:
"go" → [
(doc2, tf=3, positions=[7,20,45]),
(doc5, tf=1, positions=[12]),
(doc9, tf=2, positions=[4,99])
]
docID → internal numeric ID (efficient, compressible).
tf → term frequency in that doc.
positions → where in the doc the term appears (for “exact phrase” queries).
Postings are stored sorted, delta-encoded, and compressed with varints or block codecs for space + speed.
Are Tokens Hashed or Stored as Words?
You might think search engines hash every word.
But: No. Terms are stored as actual strings in a term dictionary (compressed FST/trie).
Why? Because we want:
Prefix queries:
"go*"→ “go”, “goal”, “gopher”Wildcards, ranges, fuzzy matching
Hashing alone would break that.
Hashing is used in specialized cases (e.g. Bloom filters, approximate matching), but for core text search, real tokens are stored.
Dealing With Multiple Fields
A JSON document has multiple fields:
{
"title": "Go in Action",
"authors": ["Donovan"],
"year": 2016
}
Search engines index each field separately:
titleinverted indexauthorsinverted indexyearstored in a numeric/facet index
At query time, you decide:
Search only in
title(title:go)Or search across multiple (
q=go&query_by=title,authors)
Results from each field are merged, often with weights (title hits > body hits)
Ranking: Why Some Results Come First
Boolean matching (“doc contains ‘go’ or not”) is too crude.
Modern engines use BM25, a refinement of TF-IDF, to rank results.
For a term t in doc d:
score(t, d) = idf(t) * ( tf * (k1+1) )
--------------------------------
tf + k1 * (1 - b + b * (docLen / avgDocLen))
Where:
tf= term frequency in docidf= rarity across all docsdocLen= doc lengthavgDocLen= average doc length
So:
Rare terms → higher weight
More occurrences → higher score (but with diminishing returns)
Shorter docs → normalized (so long docs don’t dominate)
Example:
Title = “Go in Action” →
"go"tf=1, docLen=3 → strong scoreBody = “… go mentioned once in 2000 words …” → weak score
Wrap-Up
Inverted index flips docs into terms → docIDs
Posting lists store docIDs + frequency + positions
Tokens are stored, not just hashed (compressed in a trie/FST)
Each field has its own index, combined at query time
Ranking with BM25 makes search results meaningful
When you run a search query, the engine:
Looks up terms in the term dictionary
Reads postings (docIDs, tf, positions)
Scores docs with BM25
Retrieves your original JSON fields
Returns the top-k results
That’s the same pipeline behind Elasticsearch, Solr, Typesense, and Meilisearch 🚀