Data Modeling

This is the decision that makes or breaks a MongoDB project: for related data, do you embed it inside the document or reference it in another collection? Get this right and everything is fast and simple. Get it wrong and you fight the database forever.

The golden rule I keep coming back to:

What gets read together should be stored together.

Embed: nest the data inside

{
  "_id": 1,
  "name": "Avina",
  "addresses": [
    { "city": "NYC" },
    { "city": "LA" }
  ]
}

One read gets everything, and writes to a single document are atomic. Embed when the related data is:

owned by the parent,
read together with it, and
bounded in size.

Classic fits: an order and its line items, a blog post and its handful of comments, a user and their settings.

Reference: point to another collection

// users
{ "_id": 1, "name": "Avina" }

// posts
{ "_id": 99, "title": "Hi", "authorId": 1 }

Reference when the data is:

shared across many parents,
large or unbounded, or
has its own lifecycle.

Classic fits: authors ↔ posts, products ↔ orders, anything many-to-many.

The decision in one table

Question	Lean embed	Lean reference
Read together?	Yes	No
Does it grow unbounded?	No	Yes
Shared by many parents?	No	Yes
Updated independently?	No	Yes

The trap: unbounded arrays

The most common modeling mistake is embedding something that grows forever — like every comment on a viral post, or every event for a user. Remember the 16 MB document limit and that every read drags the whole document into RAM. An array that only grows is a time bomb. The moment data is unbounded, reference it.

The patterns that fall out of this

"Embed vs reference" is the foundation, but the community has named a handful of recurring moves that solve the tricky middle cases. These are the ones I actually see earn their keep:

Pattern	The move	Solves
Subset	embed the hot slice (last 10 comments), reference the rest	a long list where you only show a few
Computed	precompute totals/averages on write, store them	expensive aggregations on every read
Bucket	pack many readings into one doc per time window (1 doc/hour)	high-frequency events that would explode the doc count
Extended reference	copy a few fields from the referenced doc (author name onto the post)	avoiding a `$lookup` for the common read
Schema versioning	stamp a `schemaVersion` field, migrate lazily on read/write	evolving shape with zero-downtime migration
Polymorphic	different shapes in one collection sharing a `type` discriminator	"products" that are books and shoes

The throughline: each one trades a little duplication or write-time work to make the read that your app does constantly cheap. That's the whole game.

A reality check

MongoDB has no enforced foreign keys — if a post points at an authorId that you later delete, nothing stops you. Referential integrity is your job, in application code (or wrap the multi-collection write in a transaction). That's the price of flexibility, and it's why you model deliberately around your read patterns instead of normalizing for its own sake.

Recap

Embed what's read together and bounded; reference what's shared, large, or unbounded. Watch out for ever-growing arrays, and remember Mongo won't enforce relationships for you — design around the queries your app actually makes.

👉 Next: Scaling

Embed: nest the data inside​

Reference: point to another collection​

The decision in one table​

The trap: unbounded arrays​

The patterns that fall out of this​

A reality check​

Recap​