Skip to main content

Indexing

Indexing is where "it works on my laptop" turns into "it works with ten million documents." Here's the data structure underneath — once you see it, you'll know why a query is fast or slow without guessing.

The problem indexes solve

Without an index, this query has to read every single document and test it:

db.users.find({ email: "a@x.com" })

That's a collection scanO(n). Fine for 100 docs, fatal for 10 million. An index is a separate, sorted structure that lets MongoDB jump straight to the match, turning that into roughly O(log n).

The structure: a B-tree

MongoDB indexes are B-trees (WiredTiger implements them as B+ trees). This is the same structure Postgres and MySQL use — learn it once, recognize it everywhere.

A B-tree is a sorted, self-balancing tree where each node holds many keys and many children:

Why "many keys per node" instead of a binary tree? Disk reads. Each node is sized to a page (one I/O). High fan-out makes the tree extremely shallow — billions of keys sit only ~3–4 levels deep. So a lookup is ~3–4 page reads instead of ~30. B-trees exist to minimize disk reads, which is the whole game.

The B+ tree refinement (what WiredTiger uses): all real values live in the leaf nodes, and leaves are linked in sorted order like a linked list. That's why these are all fast on an indexed field:

Query shapeWhy it's fast
Equality (email = x)Walk root → leaf, O(log n)
Range (age >= 18 && age <= 30)Find 18, then walk the linked leaves to 30
Sort (sort({ age: 1 }))The index is already sorted — no in-memory sort needed
Play with one live

The thing that finally made B-trees click for me was watching them rebalance. Open the USFCA B-tree visualizer and insert keys one at a time — watch a node fill up, split, and push a key up to its parent. That split-and-promote is exactly how the tree stays shallow and balanced no matter what order the data arrives in.

Creating and reading indexes

db.users.createIndex({ email: 1 })
db.users.createIndex({ email: 1 }, { unique: true })
db.users.createIndex({ lastName: 1, firstName: 1 })
db.users.getIndexes()

1 = ascending, -1 = descending. _id is always indexed automatically.

The compound index rule everyone trips on

A compound index { a: 1, b: 1, c: 1 } is one tree sorted by a, then b, then c — like sorting a spreadsheet by column A, then B, then C. It can serve a query that uses a left prefix of those keys:

A handy ordering heuristic is ESR: put Equality fields first, then Sort fields, then Range fields.

The index types beyond the basics

A plain single/compound index is just the start. The tree is the same B-tree underneath — these types change what key gets stored in it:

TypeWhat it indexesReach for it when
Multikeycreated automatically the moment you index an array field — one index entry per elementyou query inside arrays ({ roles: "admin" }). Caveat: you can't compound-index two array fields together
TTLa single Date field; a background thread deletes docs N seconds past itsessions, logs, carts, anything self-expiring
Partialonly the docs matching a filter — a smaller, cheaper treeyou only ever query a subset (e.g. { status: "active" }); prefer this over sparse
Sparseonly docs where the field existsan optional field most docs lack
Texttokenized words (stemming, stop-words) for $text searchbasic keyword search (one text index per collection)
2dsphereGeoJSON points/shapesgeo queries — $near, $geoWithin
Hashedthe hash of a field's valuehashed sharding & equality (never range)
Wildcard{ "$**": 1 } — arbitrary/unknown field namesgrab-bag documents where you can't predict the keys

Two properties worth knowing: collation makes an index case/locale-insensitive, and a hidden index is maintained but invisible to the planner — perfect for testing whether dropping an index hurts, with an instant rollback.

Covered queries — a free win

If a query's filter and the fields it returns are all inside the index, MongoDB answers entirely from the index tree and never fetches the document:

db.users.createIndex({ email: 1, name: 1 })
db.users.find({ email: "a@x.com" }, { _id: 0, email: 1, name: 1 })

Don't index everything

Every index is another B-tree that must be updated on every write, and it competes for that precious RAM cache. So: index the fields you actually filter and sort on, prefer compound indexes that serve several queries, and drop the ones nobody uses.

Prove it with explain()

MongoDB has a query planner that picks an index. You can see its choice:

db.users.find({ email: "a@x.com" }).explain("executionStats")

What I look for:

  • IXSCAN (index scan) = 🟢 good. COLLSCAN (collection scan) = 🔴 no usable index.
  • The winningPlan is what ran; rejectedPlans are the alternatives the planner raced and lost. The planner actually trial-runs candidates and caches the winner for similar queries.
  • The three numbers that tell the truth: keysExamineddocsExaminednReturned. The dream is 1 : 1 : 1 (or n : 0 : n for a covered query). If you examined 1,000,000 keys/docs to return 10, your index is wrong or missing.

Recap

An index is a separate, sorted B+ tree. High fan-out keeps it shallow (few disk reads); sorted, linked leaves make equality, range, and sort O(log n). Compound indexes follow the left-prefix / ESR rule, covered queries skip the document fetch, and explain() tells you the truth — aim for IXSCAN.

👉 Next: Querying & Aggregation