Data Modeling
This is the decision that makes or breaks a MongoDB project: for related data, do you embed it inside the document or reference it in another collection? Get this right and everything is fast and simple. Get it wrong and you fight the database forever.
The golden rule I keep coming back to:
What gets read together should be stored together.
Embed: nest the data inside
{
"_id": 1,
"name": "Avina",
"addresses": [
{ "city": "NYC" },
{ "city": "LA" }
]
}
One read gets everything, and writes to a single document are atomic. Embed when the related data is:
- owned by the parent,
- read together with it, and
- bounded in size.
Classic fits: an order and its line items, a blog post and its handful of comments, a user and their settings.
Reference: point to another collection
// users
{ "_id": 1, "name": "Avina" }
// posts
{ "_id": 99, "title": "Hi", "authorId": 1 }
Reference when the data is:
- shared across many parents,
- large or unbounded, or
- has its own lifecycle.
Classic fits: authors ↔ posts, products ↔ orders, anything many-to-many.
The decision in one table
| Question | Lean embed | Lean reference |
|---|---|---|
| Read together? | Yes | No |
| Does it grow unbounded? | No | Yes |
| Shared by many parents? | No | Yes |
| Updated independently? | No | Yes |
The trap: unbounded arrays
The most common modeling mistake is embedding something that grows forever — like every comment on a viral post, or every event for a user. Remember the 16 MB document limit and that every read drags the whole document into RAM. An array that only grows is a time bomb. The moment data is unbounded, reference it.
The patterns that fall out of this
"Embed vs reference" is the foundation, but the community has named a handful of recurring moves that solve the tricky middle cases. These are the ones I actually see earn their keep:
| Pattern | The move | Solves |
|---|---|---|
| Subset | embed the hot slice (last 10 comments), reference the rest | a long list where you only show a few |
| Computed | precompute totals/averages on write, store them | expensive aggregations on every read |
| Bucket | pack many readings into one doc per time window (1 doc/hour) | high-frequency events that would explode the doc count |
| Extended reference | copy a few fields from the referenced doc (author name onto the post) | avoiding a $lookup for the common read |
| Schema versioning | stamp a schemaVersion field, migrate lazily on read/write | evolving shape with zero-downtime migration |
| Polymorphic | different shapes in one collection sharing a type discriminator | "products" that are books and shoes |
The throughline: each one trades a little duplication or write-time work to make the read that your app does constantly cheap. That's the whole game.
A reality check
MongoDB has no enforced foreign keys — if a post points at an authorId that you later delete, nothing stops you. Referential integrity is your job, in application code (or wrap the multi-collection write in a transaction). That's the price of flexibility, and it's why you model deliberately around your read patterns instead of normalizing for its own sake.
Recap
Embed what's read together and bounded; reference what's shared, large, or unbounded. Watch out for ever-growing arrays, and remember Mongo won't enforce relationships for you — design around the queries your app actually makes.
👉 Next: Scaling