How Search Engines Work

Ever wondered what’s happening under the hood of Elasticsearch, Lucene, or Typesense? I recently went down that rabbit hole and built a tiny search engine in Go. Here’s a breakdown of the key concepts that power every full-text search system.

Inverted Index: The Heart of Search

Normally, a “forward index” looks like this:

doc1 → ["go", "makes", "search", "fun"]
doc2 → ["building", "search", "engines", "in", "go"]

That’s great if you want to read the doc.
But search engines flip it around into an inverted index:

"go"      → [doc1, doc2]
"search"  → [doc1, doc2]
"engines" → [doc2]

Posting Lists: Beyond Just Doc IDs

A posting list is the array of documents where a term appears. At its simplest:

"go" → [2, 5, 9, 15]

But real systems store much more for ranking and phrase queries:

"go" → [
  (doc2, tf=3, positions=[7,20,45]),
  (doc5, tf=1, positions=[12]),
  (doc9, tf=2, positions=[4,99])
]

docID → internal numeric ID (efficient, compressible).
tf → term frequency in that doc.
positions → where in the doc the term appears (for “exact phrase” queries).

Postings are stored sorted, delta-encoded, and compressed with varints or block codecs for space + speed.

Are Tokens Hashed or Stored as Words?

You might think search engines hash every word.
But: No. Terms are stored as actual strings in a term dictionary (compressed FST/trie).

Why? Because we want:

Prefix queries: "go*" → “go”, “goal”, “gopher”
Wildcards, ranges, fuzzy matching

Hashing alone would break that.

Hashing is used in specialized cases (e.g. Bloom filters, approximate matching), but for core text search, real tokens are stored.

Dealing With Multiple Fields

A JSON document has multiple fields:

{
  "title": "Go in Action",
  "authors": ["Donovan"],
  "year": 2016
}

Search engines index each field separately:

title inverted index
authors inverted index
year stored in a numeric/facet index

At query time, you decide:

Search only in title (title:go)
Or search across multiple (q=go&query_by=title,authors)

Results from each field are merged, often with weights (title hits > body hits)

Ranking: Why Some Results Come First

Boolean matching (“doc contains ‘go’ or not”) is too crude.
Modern engines use BM25, a refinement of TF-IDF, to rank results.

For a term t in doc d:

score(t, d) = idf(t) * ( tf * (k1+1) )
              --------------------------------
              tf + k1 * (1 - b + b * (docLen / avgDocLen))

Where:

tf = term frequency in doc
idf = rarity across all docs
docLen = doc length
avgDocLen = average doc length

So:

Rare terms → higher weight
More occurrences → higher score (but with diminishing returns)
Shorter docs → normalized (so long docs don’t dominate)

Example:

Title = “Go in Action” → "go" tf=1, docLen=3 → strong score
Body = “… go mentioned once in 2000 words …” → weak score

Wrap-Up

Inverted index flips docs into terms → docIDs
Posting lists store docIDs + frequency + positions
Tokens are stored, not just hashed (compressed in a trie/FST)
Each field has its own index, combined at query time
Ranking with BM25 makes search results meaningful

When you run a search query, the engine:

Looks up terms in the term dictionary
Reads postings (docIDs, tf, positions)
Scores docs with BM25
Retrieves your original JSON fields
Returns the top-k results

That’s the same pipeline behind Elasticsearch, Solr, Typesense, and Meilisearch 🚀

How Search Engines Work

Inverted Index: The Heart of Search

Posting Lists: Beyond Just Doc IDs

Are Tokens Hashed or Stored as Words?

Dealing With Multiple Fields

Ranking: Why Some Results Come First

Wrap-Up

Comments

More from this blog

High-Velocity CQRS Bidding Service — Design Document

Concurrency in Go: From Goroutines to Production-Grade Patterns

Upgrade Paths for PostgreSQL and Cloud SQL Users on GCP: Choosing Between CockroachDB and Cloud Spanner

Designing a Scalable Blogging Platform on Google Cloud – A System Design + GCP PA Certification Exercise

Command Palette

Inverted Index: The Heart of Search

Posting Lists: Beyond Just Doc IDs

Are Tokens Hashed or Stored as Words?

Dealing With Multiple Fields

Ranking: Why Some Results Come First

Wrap-Up

Comments

More from this blog