How YouTube Recommends Videos


Back in 2008, YouTube had passed Yahoo! to become the second largest search engine in the world, behind only Google. Today, we can ask a related question: “Is YouTube about to pass Amazon as the largest scaled and most sophisticated industrial recommendation system in existence?” This question isn’t rhetorical – because we don’t know the answer as YouTube fiercely competes with the Amazon recommendation system.

YouTube suggested videos are a force multiplier for YouTube’s search algorithm that we would need to understand.

Earlier YouTube Recommendation Process

To maximize your presence in YouTube search and suggested videos, you need to make sure your metadata is well-optimized. This includes your video’s title, description, and tags. Most SEOs focus on the search results – because that’s what matters in Google.

How to create metadata tags in YouTube?

We need to look at the relevant top-ranking video and then use as many of the tags as we could that were also relevant for our video.

Recent YouTube Recommendation Behaviour

The scenario with the YouTube Recommendation approach is changed now. To get repeated viewers, the video must be recognized by the YouTube Recommendation Process. But, most YouTube marketers know that appearing in suggested videos can generate almost as many views as appearing in YouTube’s search results.

Why? Because viewers tend to watch multiple videos during sessions that last about 40 minutes, on average. So, a viewer might conduct one search, watch a video, and then go on to watch a suggested video. In other words, you might get two or more videos viewed for each search that’s conducted on YouTube. That’s what makes suggested videos a force multiplier for YouTube’s search algorithm.

How does YouTube Recommend Videos – Lighter Approach

There is a video in YouTube on the YouTube Creators channel entitled “How YouTube’s Suggested Videos Work”.

As the video’s 300-word description explains:

“Suggested Videos are a personalized collection of videos that an individual viewer may be interested in watching next, based on prior activity.”

“Studies of YouTube consumption have shown that viewers tend to watch a lot more when they get recommendations from a variety of channels and suggested videos do just that. Suggested Videos are ranked to maximize engagement for the viewer.”

So, optimizing your metadata still helps, but you also need to create a compelling opening to your videos, maintain and build interest throughout the video, as well as engage your audience by encouraging comments and interacting with your viewers as part of your content.

How YouTube Recommends Videos – Recommender Systems

Recommender Systems are among the most common forms of Machine Learning that users will encounter, whether they’re aware of it or not. It powers curated timelines on Facebook and Twitter, and “suggested videos” on YouTube.

Previously formulated as a matrix factorization problem that attempts to predict a movie’s ratings for a particular user, many are now approaching this problem using Deep Learning; the intuition is that non-linear combinations of features may yield a better prediction than a traditional matrix factorization approach can.

In 2016, Covington, Adams, and Sargin demonstrated the benefits of this approach with “Deep Neural Networks for YouTube Recommendations”, making Google one of the first companies to deploy production-level deep neural networks for recommender systems.

Given that YouTube is the second most visited website in the United States, with over 400 hours of content uploaded per minute, recommending fresh content poses no straightforward task. In their research paper, Covington et al. demonstrate a two-stage information retrieval approach, where one network generates recommendations, and a second network ranks these generated recommendations. This approach is quite thoughtful; since recommending videos can be posed as an extreme multiclass classification problem, having one network to reduce the cardinality of the task from a few million data points into a few hundred data points permits the ranking network to take advantage of more sophisticated features which may have been too minute for the candidate generation model to learn.


There were two main factors behind YouTube’s Deep Learning approach towards Recommender Systems:

  • Scale: Due to the immense sparsity of these matrices, it’s difficult for previous matrix factorization approaches to scale amongst the entire feature space. Additionally, previous matrix factorization approaches have a difficult time handling a combination of categorical and continuous variables.
  • Consistency: Many other product-based teams at Google have switched to deep learning as a general framework for learning problems. Since Google Brain has released TensorFlow, it is sufficiently easy to train, test, and deploy deep neural networks in a distributed fashion.

Network Structure

There are two networks at play:

  • The candidate generation network takes the user’s activity history ****(eg. IDs of videos being watched, search history, and user-level demographics) and outputs a few hundred videos that might broadly apply to the user. The general idea is that this network should optimize for precision; each instance should be highly relevant, even if it requires forgoing some items which may be widely popular but irrelevant.
  • In contrast, the ranking network takes a richer set of features for each video, and score each item from the candidate generation network. For this network, it’s important to have a high recall; it’s okay for some recommendations to not be very relevant as long as you’re not missing the most relevant items*.***

On the whole, this network is trained end-to-end; the training and test set consists of hold-out data. In other words, the network is given a user’s time history until some time t, and the network is asked what they would like to watch at time t+1! The authors believe this was among the best ways to recommend videos provided the episodic nature of videos on YouTube.

Performance Hacks

In both the candidate generation and candidate ranking networks, the authors leverage various tricks to help reduce dimensionality or performance from the model. We discuss these here, as they’re relevant to both models.

First, they trained a subnetwork to transform sparse features (such as video IDs, search tokens, and user IDs) into dense features by learning an embedding for these features. This embedding is learned jointly with the rest of the model parameters via gradient descent.

Secondly, to aid against the exploitation/exploration problem, they feed the age of the training example as a feature. This helps overcome the implicit bias in models which tend to recommend stale content, as a result of the average watch likelihood during training time. At serving time, they simply set the age of the example to be zero to compensate for this factor.

Ranking the Predictions

The fundamental idea behind partitioning the recommender system into two networks is that this provides the ability for the ranking network to examine each video with a finer tooth comb than the candidate generation model was able to.

For example, the candidate generation model may only have access to features such as video embedding, and the number of watches. In contrast, the ranking network can take features such as the thumbnail image and the interest of their peers to provide a much more accurate scoring.

The objective of the ranking network is to maximize the expected watch time for any given recommendation. Covington et al. decided to attempt to maximize watch time over the probability of a click, due to the common “clickbait” titles in videos.

Similar to the candidate generation network, the authors use embedding spaces to map sparse categorical features into dense representations. Any features which relate to multiple items (i.e. searches over multiple video IDs, etc) are averaged before being fed into the network. However, categorical features which depend upon the same underlying feature (i.e. video IDs of the impression, last video ID watched, etc) are shared between these categories to preserve memory and runtime requirements.

As far as continuous features go, they’re normalized in two ways.

  • First, it follows the standard normalization between [0, 1), using a cumulative uniform distribution.
  • Secondly, in addition to the standard normalization x, the form sqrt(x) and  are also fed. This permits the model to create super and sub-linear functions of each feature, which is crucial to improving offline accuracy.

To predict expected watch time, the authors used logistic regression. Clicked impressions were weighed with the observed watch time, whereas negative examples all received unit weight. In practice, this is a modeled probability ET, where E[T] models the expected watch time of the impression, and P models the probability of clicking the video.

Finally, the authors demonstrated the impact of a wider and deeper network on per-user loss. The per-user loss was the total amount of mispredicted watch time, against the total watch time on held-out data. This permits the model to predict something that is a proxy to a good recommendation; rather than predicting a good recommendation itself.


“Deep Neural Networks for YouTube Recommendations” was one of the first papers to highlight the advancements that Deep Learning may provide for Recommender Systems, and appeared in ACM’s 2016 Conference on Recommender Systems. It laid the foundation for many papers afterward. So, it has been a fantastic journey for the YouTube in the past decade to improve the recommendation process which in turn helps to keep the viewers intact. There are statistics that YouTube app in mobiles has replaced watching television to a great extent around the world. Not at all a simple task, we must sincerely appreciate the people behind it to happen.

“We will soon trade in our clunky flat screens for its handheld cousin, the smartphone and its YouTube app.”