Skip to main content

Data Modeling

This is the decision that makes or breaks a MongoDB project: for related data, do you embed it inside the document or reference it in another collection? Get this right and everything is fast and simple. Get it wrong and you fight the database forever.

The golden rule I keep coming back to:

What gets read together should be stored together.

Embed: nest the data inside

{
"_id": 1,
"name": "Avina",
"addresses": [
{ "city": "NYC" },
{ "city": "LA" }
]
}

One read gets everything, and writes to a single document are atomic. Embed when the related data is:

  • owned by the parent,
  • read together with it, and
  • bounded in size.

Classic fits: an order and its line items, a blog post and its handful of comments, a user and their settings.

Reference: point to another collection

// users
{ "_id": 1, "name": "Avina" }

// posts
{ "_id": 99, "title": "Hi", "authorId": 1 }

Reference when the data is:

  • shared across many parents,
  • large or unbounded, or
  • has its own lifecycle.

Classic fits: authors ↔ posts, products ↔ orders, anything many-to-many.

The decision in one table

QuestionLean embedLean reference
Read together?YesNo
Does it grow unbounded?NoYes
Shared by many parents?NoYes
Updated independently?NoYes

The trap: unbounded arrays

The most common modeling mistake is embedding something that grows forever — like every comment on a viral post, or every event for a user. Remember the 16 MB document limit and that every read drags the whole document into RAM. An array that only grows is a time bomb. The moment data is unbounded, reference it.

The patterns that fall out of this

"Embed vs reference" is the foundation, but the community has named a handful of recurring moves that solve the tricky middle cases. These are the ones I actually see earn their keep:

PatternThe moveSolves
Subsetembed the hot slice (last 10 comments), reference the resta long list where you only show a few
Computedprecompute totals/averages on write, store themexpensive aggregations on every read
Bucketpack many readings into one doc per time window (1 doc/hour)high-frequency events that would explode the doc count
Extended referencecopy a few fields from the referenced doc (author name onto the post)avoiding a $lookup for the common read
Schema versioningstamp a schemaVersion field, migrate lazily on read/writeevolving shape with zero-downtime migration
Polymorphicdifferent shapes in one collection sharing a type discriminator"products" that are books and shoes

The throughline: each one trades a little duplication or write-time work to make the read that your app does constantly cheap. That's the whole game.

A reality check

MongoDB has no enforced foreign keys — if a post points at an authorId that you later delete, nothing stops you. Referential integrity is your job, in application code (or wrap the multi-collection write in a transaction). That's the price of flexibility, and it's why you model deliberately around your read patterns instead of normalizing for its own sake.

Recap

Embed what's read together and bounded; reference what's shared, large, or unbounded. Watch out for ever-growing arrays, and remember Mongo won't enforce relationships for you — design around the queries your app actually makes.

👉 Next: Scaling