Skip to main content

How It Stores Data

This is the page I most wish existed when I started. You can use MongoDB for months without knowing any of this — but the moment you understand it, performance and durability stop being mysteries.

First: BSON, not JSON

You write JSON, but MongoDB stores BSON — Binary JSON. Why bother converting?

BSON buys three things plain JSON text can't:

  1. Speed. Every field is length-prefixed, so MongoDB can skip over a field it doesn't need with pointer math instead of parsing characters one by one.
  2. Richer types. JSON only has string/number/bool/null/array/object. BSON adds ObjectId, real Date, int32/int64, Decimal128, binary. This is why dates and ids "just work" for sorting and range queries.
  3. Traversability — it's designed to be walked field-by-field efficiently.

The trade-off: BSON is sometimes slightly larger than the equivalent JSON text (it stores lengths and type tags). It spends a little space to buy a lot of speed. Good deal.

The storage engine: WiredTiger

Here's the question most tutorials skip: what actually writes the bytes to disk? That's the storage engine — a pluggable layer underneath the query language responsible for files, caching, concurrency, and durability.

Since MongoDB 3.2 the default engine is WiredTiger. (The old one, MMAPv1, was removed in 4.2 — ignore it if you see it in old docs.)

WiredTiger gives MongoDB five things you should be able to explain:

1. Everything is a B-tree

WiredTiger stores every collection and every index as its own B-tree file on disk (under your data path you'll literally see one .wt file per collection and per index). So a "collection" is physically a tree; an "index" is another tree pointing into it. We go deep on that structure in Indexing.

2. The cache — why RAM is everything

WiredTiger keeps the hot working set of tree pages in an in-memory cache (default ≈ 50% of RAM − 1 GB). The read path looks like this:

The single most important performance fact about MongoDB: you want your indexes plus your working set of documents to fit in RAM. When they do, it flies. When the working set spills past the cache, it starts hitting disk and slows down sharply.

3. Durability — the write-ahead journal

WiredTiger does not write every change straight to the main data files — that would be slow random I/O. Instead:

  • Each write is first appended to a journal (a Write-Ahead Log) — sequential and fast.
  • Every ~60 seconds a checkpoint flushes a consistent snapshot to the data files.
  • On a crash, MongoDB loads the last checkpoint and replays the journal to recover anything newer.

This is the classic database move: sequential log now, organized files later. It turns slow random writes into fast sequential ones while staying crash-safe. You'll meet the exact same pattern again in PostgreSQL.

4. Concurrency — MVCC, document-level

WiredTiger uses MVCC (Multi-Version Concurrency Control): it keeps multiple versions of a document and hands each operation a consistent snapshot in time. Readers don't block writers, writers don't block readers, and locking happens at the document level — so two writes to different documents in the same collection don't fight. (MMAPv1 locked the whole collection — this was a huge upgrade.) Those snapshots are also what make multi-document transactions possible.

5. Compression — why the files are smaller than your data

This is the detail behind that word "compressed" in the recap. WiredTiger compresses on two axes, and it's a deliberate CPU-for-I/O trade: spend a little CPU to shrink the bytes, so more of your data fits in cache and fewer pages hit disk.

LayerDefaultWhat it does
Collection blockssnappycompresses each ~32 KB data block. zstd/zlib are options for a higher ratio at more CPU
Index pagesprefix compressionadjacent keys in a sorted B-tree share long common prefixes ("user:1001", "user:1002") — store the prefix once

The win compounds: smaller pages mean the same RAM cache holds more of your working set, which is the single biggest performance lever from the section above.

Recap

Mongo stores documents as BSON inside WiredTiger, which keeps each collection and index as a compressed B-tree, serves a hot working set from an in-RAM cache, guarantees durability with a journal + checkpoints, and isolates concurrent work with document-level MVCC.

Now that you know data and indexes are both B-trees, let's see why that makes queries fast.

👉 Next: Indexing