Skip to main content

Command Palette

Search for a command to run...

Data Modeling for YouTube-Like Platforms: A Practical Guide

Updated
10 min read
Data Modeling for YouTube-Like Platforms: A Practical Guide
K
Backend-focused web developer and engineering student.


Understanding Data Modeling {#understanding}

Data modeling isn't about memorizing a list of fields. It's about translating how your platform actually works into database structure. When you're designing a YouTube clone, you're really just answering three straightforward questions:

What are the main things in your system? These become your collections or tables. What information do you actually need to store about each thing? Those become your fields. And how do these things connect? Those become your relationships and references.

The key insight is to think in terms of nouns and actions. What objects exist in your system, and what can users do with them?


Core Entities in Your System {#entities}

Before you write a single schema, step back and ask: what would exist in this system even if nobody was using it right now?

For a platform like YouTube, you've got users who upload videos. Those videos get comments. People like things. Users subscribe to creators. Some platforms have posts or tweets. Each of these is a real thing that needs its own collection.

So you end up with:

  • users

  • videos

  • comments

  • likes

  • tweets

  • subscriptions

That's the foundation. Everything else flows from these core entities.


Deciding What Fields Actually Matter {#fields}

Here's where beginners get stuck. They either overthink it and create 50 fields, or they're paralyzed trying to guess what to include. The real approach is simpler: you figure out what fields you need by looking at your actual features and requirements.

Take users as an example. What information do you actually need about a user? You need something to identify them and let them log in—so username and email. You need to verify they're who they claim to be—so you're storing a hashed password. You want to show their profile picture and cover image. For authentication, you're probably using refresh tokens. You'll want to know when their account was created.

Once you're running the platform for a bit, you might realize you want to store watch history. That becomes an array of video IDs that this user has watched. But you add that only when you actually need it.

{
  username: String,
  email: String,
  password: String,
  avatar: String,
  coverImage: String,
  refreshToken: String,
  watchHistory: [ObjectId],
  createdAt: Date,
  updatedAt: Date
}

Notice what's missing? You don't store the user's video count, subscriber count, or anything that can be calculated. Those are derived values.

For videos, you ask similar questions. Who uploaded this? How do we play it? How do we show it to people?

{
  videoFile: String,
  thumbnail: String,
  title: String,
  description: String,
  duration: Number,
  views: Number,
  isPublished: Boolean,
  owner: ObjectId,
  createdAt: Date,
  updatedAt: Date
}

The critical part here is that owner is just an ObjectId pointing to a user. You don't duplicate the user's information inside the video. That would be a mess to maintain.


Handling Relationships {#relationships}

This is where the real design decisions happen. You've got different kinds of relationships, and they determine how you structure your data.

A user creates many videos, so the video document stores which user created it. One video gets many comments. So each comment document stores which video it belongs to. Since a user can comment on multiple things, each comment also stores which user wrote it.

This is why comments reference both the video and the user. It's not redundant; it's necessary for the relationship.

{
  content: String,
  owner: ObjectId,
  video: ObjectId,
  createdAt: Date,
  updatedAt: Date
}

With this structure, you can easily find all comments on a video, or all comments a specific user has made. You can't do both efficiently if you embed comments inside videos.


Why Likes Needs Its Own Collection {#likes}

A lot of people new to this ask why likes isn't just stored inside the video. The answer becomes obvious once you think about the actual patterns in your system.

A user can like many things. A video can be liked by many users. Comments can be liked. Tweets can be liked. Some platforms let you like likes. This is a many-to-many relationship, which means a separate collection makes sense.

{
  likedBy: ObjectId,
  video: ObjectId,
  comment: ObjectId,
  tweet: ObjectId,
  createdAt: Date
}

This design has real advantages. You can easily check whether a specific user liked a specific video. You can prevent duplicate likes using a unique index on (likedBy, video). You can count likes with a simple query. And when a user unlikes something, it's just a delete operation.

If you tried to store likes inside the video as an array, you'd end up with massive documents that grow unbounded. You'd also struggle to answer questions like "what did user X like?" because you'd have to scan every video.


Derived Data vs Stored Data {#derived}

The temptation to store everything is real. But some things shouldn't be stored; they should be calculated when you need them.

Take likes count. You might think you should store likesCount inside the video document. Don't. Calculate it instead.

likesCount = db.likes.count({video: videoId})

Why? Because if you store it, you have to update it every time someone likes or unlikes. In high-traffic situations, you get race conditions. Multiple requests hit at the same time, reads happen in between writes, and your count ends up wrong.

By keeping it derived, your source of truth is always correct. The likes collection is the single source of truth. If you need to optimize later because queries are slow, you can add caching. But you always recalculate from the likes collection as your source of truth.


Social Features and Polymorphism {#social}

If your platform has tweets or posts alongside videos, you can reuse the same likes collection. This is called polymorphic association—one Like can point to different types of things.

{
  likedBy: ObjectId,
  video: ObjectId,
  comment: ObjectId,
  tweet: ObjectId,
  createdAt: Date
}

Most of these fields will be null for any given like. The one that has a value tells you what was liked. This keeps your backend clean instead of having separate collections for likes on videos, likes on comments, and likes on tweets.


User-to-User Relationships {#userrelationships}

Subscriptions are interesting because they're a relationship between users. When someone subscribes to a creator, you're storing a link from one user to another.

{
  subscriber: ObjectId,
  channel: ObjectId,
  createdAt: Date
}

You never store an array of subscribers inside the user document, especially not in systems that might grow large. A separate subscriptions collection lets you scale without issues. You can index on both fields and query efficiently in either direction.


Why References Beat Embedding {#references}

Embedding—putting all related data inside one document—feels convenient at first. But it breaks down quickly.

If you store all comments inside a video document:

{
  videoFile: String,
  title: String,
  comments: [
    // thousands of comments here
  ]
}

You end up with massive documents. You can't paginate efficiently. You can't fetch a video without loading all its comments. You can't update a comment without updating the entire video. Indexes become less useful.

The right way is to store a reference:

{
  videoFile: String,
  title: String
}

And comments reference the video. Now each entity is independent. Comments can grow without limit. You can fetch comments separately, paginate them, index them properly. Updates are fast and targeted.


Making Design Decisions as You Build {#decisions}

You don't design everything upfront. You start minimal and add as you go.

When a new feature comes up, think through it plainly. Say users want to like videos. Ask yourself: Who performs this action? A user. On what? A video. Can it happen multiple times? Yes, actually no—once per user per video. Do we need history? Maybe later, but not now.

Based on those answers, you know you need a likes collection with at minimum these fields:

{
  likedBy: ObjectId,
  video: ObjectId,
  createdAt: Date
}

That's all you need. Add more fields only when you discover you actually need them.

The mental model to keep is simple: collections represent real entities, fields store the data those entities need, references show how entities connect. Avoid storing the same information in multiple places. Optimize for correctness first; performance optimizations come later once you actually know what's slow.


Watch Later vs Playlist: A Practical Example {#watchlater}

This question comes up all the time: should watch later be a separate collection, or use the playlist collection?

The answer depends on whether they're actually different. Let's compare.

A playlist belongs to a user, has a name, stores multiple videos, lets you add and remove videos, and the order might matter. A watch later list also belongs to a user, has a name (Watch Later), stores videos, lets you add and remove videos, and order could matter.

Structurally and behaviorally, they're identical. The only difference is intent. One is called "My Workout Videos" and the other is called "Watch Later." But the underlying data structure is the same.

So use one collection and distinguish them with a field:

{
  name: String,
  description: String,
  videos: [ObjectId],
  owner: ObjectId,
  type: "NORMAL" | "WATCH_LATER",
  isPrivate: Boolean,
  createdAt: Date,
  updatedAt: Date
}

For watch later, you'd create:

{
  name: "Watch Later",
  type: "WATCH_LATER",
  owner: userId
}

Each user has exactly one Watch Later playlist. You can enforce this with a unique index on (owner, type) where type is "WATCH_LATER", or handle it at the application level by auto-creating it when a user signs up.

Why is this better than separate collections? Because you avoid duplication. The logic to add a video, remove a video, paginate videos—that all works for both now. Your APIs are simpler. Your code has fewer bugs. Your analytics are easier.

Real platforms do exactly this. Watch Later, Liked Videos, History—they're all just special playlists with system-reserved names. It's a clean pattern that scales.

The key principle: same structure plus same behavior equals same collection. A different name doesn't mean a different data model.


FAQ

Q: How many fields should a collection have? A: Only what you need for the features you're building. Start with the minimum. If a video needs a thumbnail URL and upload date but you don't have a feature that uses upload date yet, you can skip it. Add fields when you actually need them.

Q: Should I embed user information inside comments, or just store the user ID? A: Store just the user ID. If you need the user's name or avatar when displaying a comment, fetch it separately or use a join. Embedding creates sync problems—if a user changes their name, you'd have to update it everywhere it's embedded.

Q: What if I need to show the count of something, like likes count? A: Calculate it when you need it. Count the likes collection. If that becomes slow at scale, add caching or denormalization, but your source of truth stays the likes collection.

Q: Can I store an array of user IDs to track subscribers instead of a separate collection? A: Not at scale. Arrays in documents have limits and become slow. A separate subscriptions collection is better. Use indexes to make queries fast.

Q: How do I prevent someone from liking the same video twice? A: Use a unique index on the likes collection: (likedBy, video). The database prevents duplicates at the data layer.

Q: Should I store video view count and update it every time someone watches? A: Increment it in your app, but be aware that concurrent updates can cause race conditions. For accuracy, consider views as derived—count them from a views/history collection instead of storing a counter. If you're storing a counter for performance, accept that it might be slightly off during high traffic.

Q: What's the difference between ObjectId and embedding? A: ObjectId is a reference—it's just a pointer to another document. Embedding is copying the whole document inside another. ObjectIds are lightweight; embedding creates bloat and sync issues.

Q: How do I know when to create a new collection? A: When you have a distinct entity that can stand alone and might grow independently. Users, videos, comments—each is distinct. Likes is distinct because the same like behavior applies to videos, comments, and tweets. System metadata that belongs to something else doesn't need its own collection.

Q: Should comments be a separate collection or embedded in videos? A: Separate collection. Videos can have unlimited comments. You need to paginate comments, fetch just new ones, delete specific comments without touching the video. Separation solves all these problems.

Q: Can I use the same likes collection for liking videos, comments, and tweets? A: Yes, that's polymorphic association. Store references to all three types, and null out the ones that don't apply. It works well in practice.

Q: What if I want to store the number of videos a user has uploaded? A: Count videos where owner matches the user ID. Don't store a counter in the user document. It gets out of sync.

More from this blog