Building Recommender Systems - Part 1: Foundations
Published on December 16, 2025
Introduction
Recommender systems are everywhere—from Netflix suggesting your next binge-worthy show to Amazon recommending products you didn't know you needed. These systems power personalization at scale, driving engagement and revenue for businesses worldwide.
In this three-part series, we'll build recommender systems from the ground up, starting with fundamentals and progressing to advanced techniques used in production. By the end, you'll understand not just the theory, but how to implement these systems in Python with real datasets.
Series Roadmap:
What is a Recommender System?
A recommender system predicts user preferences for items they haven't interacted with yet. The goal is to surface relevant content that maximizes user engagement and satisfaction.
Real-World Examples
Netflix: "Because you watched Stranger Things..."
Amazon: "Customers who bought this also bought..."
Spotify: Discover Weekly playlist
LinkedIn: "People You May Know"
Types of Recommender Systems
1. Content-Based Filtering
Core Idea: Recommend items similar to what the user liked in the past.
How it works:
Example: If you liked "The Matrix" (sci-fi, action, 1999), recommend "Inception" (sci-fi, action, 2010)
Advantages:
Limitations:
2. Collaborative Filtering
Core Idea: Recommend items that similar users liked.

Collaborative Filtering Diagram
How it works:
Example: Users who liked movies A, B, and C also liked movie D → recommend D
Advantages:
Limitations:
3. Hybrid Approaches
Combine multiple techniques to get the best of both worlds. We'll cover these in Parts 2 and 3.
Building a Content-Based Recommender
Let's build a movie recommender using the MovieLens dataset. We'll use movie features like genres, directors, and cast.
Step 1: Setup and Data Loading
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Load MovieLens data
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
# Sample data structure
# movies: movieId, title, genres
# ratings: userId, movieId, rating, timestamp
print(movies.head())
# movieId title genres
# 0 1 Toy Story (1995) Adventure|Animation|Children
# 1 2 Jumanji (1995) Adventure|Children|Fantasy
# 2 3 Grumpier Old Men (1995) Comedy|RomanceStep 2: Create Item Profiles
We'll use TF-IDF to convert genres into numerical vectors.
# Create a combined feature from genres (you can add more features)
movies['features'] = movies['genres'].str.replace('|', ' ')
# Create TF-IDF vectors
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['features'])
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")
# Output: TF-IDF Matrix Shape: (9742, 20)
# 9742 movies, 20 unique genre tokensStep 3: Calculate Item Similarity
# Compute cosine similarity between all movie pairs
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# Create a mapping from movie titles to indices
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()
print(f"Similarity matrix shape: {cosine_sim.shape}")
# Output: (9742, 9742)Step 4: Generate Recommendations
def get_content_recommendations(title, top_n=10):
"""
Get top N movie recommendations based on content similarity
Args:
title: Movie title to find recommendations for
top_n: Number of recommendations to return
Returns:
List of recommended movie titles
"""
# Get the index of the movie
idx = indices[title]
# Get similarity scores for all movies with this movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort movies by similarity score
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get top N most similar movies (excluding itself)
sim_scores = sim_scores[1:top_n+1]
# Get movie indices
movie_indices = [i[0] for i in sim_scores]
# Return movie titles and scores
recommendations = movies.iloc[movie_indices][['title', 'genres']].copy()
recommendations['similarity_score'] = [score[1] for score in sim_scores]
return recommendations
# Test the recommender
recommendations = get_content_recommendations('Toy Story (1995)', top_n=5)
print(recommendations)Expected Output:
title genres similarity_score
Toy Story 2 (1999) Adventure|Animation|Children 0.92
Antz (1998) Adventure|Animation|Children 0.87
Monsters, Inc. (2001) Adventure|Animation|Children 0.87
Finding Nemo (2003) Adventure|Animation|Children 0.87
Shrek (2001) Adventure|Animation|Children 0.87Step 5: Enhance with More Features
# If you have more metadata (cast, director, keywords)
movies['combined_features'] = (
movies['genres'].fillna('') + ' ' +
movies['director'].fillna('') + ' ' +
movies['cast'].fillna('') + ' ' +
movies['keywords'].fillna('')
)
# Different weights for different features
from sklearn.feature_extraction.text import TfidfVectorizer
class WeightedTfidfVectorizer:
def __init__(self):
self.genre_vec = TfidfVectorizer(stop_words='english')
self.cast_vec = TfidfVectorizer(stop_words='english', max_features=50)
def fit_transform(self, movies_df):
genre_matrix = self.genre_vec.fit_transform(movies_df['genres'])
cast_matrix = self.cast_vec.fit_transform(movies_df['cast'])
# Combine with weights (genres: 0.6, cast: 0.4)
from scipy.sparse import hstack
combined = hstack([genre_matrix * 0.6, cast_matrix * 0.4])
return combined
weighted_vec = WeightedTfidfVectorizer()
weighted_matrix = weighted_vec.fit_transform(movies)Building a Collaborative Filtering Recommender
Now let's build a user-based collaborative filter that finds similar users and recommends based on their preferences.
Step 1: Create User-Item Matrix
# Create user-item rating matrix
user_item_matrix = ratings.pivot_table(
index='userId',
columns='movieId',
values='rating'
)
print(f"Matrix shape: {user_item_matrix.shape}")
# Output: Matrix shape: (610, 9724)
# 610 users, 9724 movies
# Check sparsity
sparsity = 1 - (user_item_matrix.count().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]))
print(f"Sparsity: {sparsity:.2%}")
# Output: Sparsity: 98.30%
# Most entries are missing - typical for real-world data!Step 2: Calculate User Similarity
from sklearn.metrics.pairwise import cosine_similarity
# Fill NaN with 0 for similarity calculation
user_item_filled = user_item_matrix.fillna(0)
# Calculate user-user similarity
user_similarity = cosine_similarity(user_item_filled)
# Convert to DataFrame for easier manipulation
user_similarity_df = pd.DataFrame(
user_similarity,
index=user_item_matrix.index,
columns=user_item_matrix.index
)
print(user_similarity_df.head())Step 3: Generate User-Based Recommendations
def get_collaborative_recommendations(user_id, top_n=10, n_similar_users=5):
"""
Get recommendations using user-based collaborative filtering
Args:
user_id: Target user ID
top_n: Number of recommendations to return
n_similar_users: Number of similar users to consider
Returns:
List of recommended movie IDs with predicted ratings
"""
# Get similar users (excluding the user itself)
similar_users = user_similarity_df[user_id].sort_values(ascending=False)[1:n_similar_users+1]
# Get movies rated by similar users but not by target user
user_ratings = user_item_matrix.loc[user_id]
unrated_movies = user_ratings[user_ratings.isna()].index
# Predict ratings for unrated movies
predictions = {}
for movie_id in unrated_movies:
# Get ratings from similar users
similar_user_ratings = user_item_matrix.loc[similar_users.index, movie_id]
# Remove NaN values
valid_ratings = similar_user_ratings.dropna()
if len(valid_ratings) > 0:
# Weighted average based on user similarity
weights = similar_users[valid_ratings.index]
predicted_rating = np.average(valid_ratings, weights=weights)
predictions[movie_id] = predicted_rating
# Sort by predicted rating
sorted_predictions = sorted(predictions.items(), key=lambda x: x[1], reverse=True)
# Get top N recommendations
top_recommendations = sorted_predictions[:top_n]
# Convert to DataFrame with movie titles
rec_df = pd.DataFrame(top_recommendations, columns=['movieId', 'predicted_rating'])
rec_df = rec_df.merge(movies[['movieId', 'title', 'genres']], on='movieId')
return rec_df
# Test the recommender
user_recommendations = get_collaborative_recommendations(user_id=1, top_n=5)
print(user_recommendations)Expected Output:
movieId predicted_rating title genres
0 2571 4.8 Matrix, The (1999) Action|Sci-Fi|Thriller
1 2959 4.7 Fight Club (1999) Action|Crime|Drama
2 1196 4.6 Star Wars: Episode V (1980) Action|Adventure|Sci-Fi
3 4993 4.5 Lord of the Rings (2001) Adventure|Fantasy
4 858 4.4 Godfather, The (1972) Crime|DramaStep 4: Item-Based Collaborative Filtering
Item-based filtering is often more scalable and stable than user-based.
# Transpose to get item-item matrix
item_user_matrix = user_item_matrix.T
# Calculate item-item similarity
item_similarity = cosine_similarity(item_user_matrix.fillna(0))
item_similarity_df = pd.DataFrame(
item_similarity,
index=user_item_matrix.columns,
columns=user_item_matrix.columns
)
def get_item_based_recommendations(user_id, top_n=10):
"""
Get recommendations using item-based collaborative filtering
"""
# Get user's rated movies
user_ratings = user_item_matrix.loc[user_id].dropna()
# For each rated movie, find similar movies
recommendations = {}
for movie_id, rating in user_ratings.items():
# Get similar items
similar_items = item_similarity_df[movie_id].sort_values(ascending=False)[1:11]
for sim_movie_id, similarity in similar_items.items():
if pd.isna(user_item_matrix.loc[user_id, sim_movie_id]):
# Weight by user's rating and item similarity
score = rating * similarity
if sim_movie_id in recommendations:
recommendations[sim_movie_id] += score
else:
recommendations[sim_movie_id] = score
# Sort and get top N
sorted_recs = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)[:top_n]
# Create result DataFrame
rec_df = pd.DataFrame(sorted_recs, columns=['movieId', 'score'])
rec_df = rec_df.merge(movies[['movieId', 'title', 'genres']], on='movieId')
return rec_df
# Test item-based recommender
item_recs = get_item_based_recommendations(user_id=1, top_n=5)
print(item_recs)Evaluation Metrics
How do we know if our recommender is good?
1. Accuracy Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
def evaluate_recommender(predictions, actuals):
"""
Evaluate recommender system predictions
"""
# RMSE (Root Mean Squared Error)
rmse = np.sqrt(mean_squared_error(actuals, predictions))
# MAE (Mean Absolute Error)
mae = mean_absolute_error(actuals, predictions)
return {'RMSE': rmse, 'MAE': mae}
# Example evaluation
print(evaluate_recommender(predicted_ratings, actual_ratings))
# Output: {'RMSE': 0.87, 'MAE': 0.68}2. Ranking Metrics
def precision_at_k(recommended, relevant, k=10):
"""
Precision@K: Proportion of recommended items that are relevant
"""
recommended_k = recommended[:k]
relevant_recommended = len(set(recommended_k) & set(relevant))
return relevant_recommended / k
def recall_at_k(recommended, relevant, k=10):
"""
Recall@K: Proportion of relevant items that are recommended
"""
recommended_k = recommended[:k]
relevant_recommended = len(set(recommended_k) & set(relevant))
return relevant_recommended / len(relevant) if len(relevant) > 0 else 0
def ndcg_at_k(recommended, relevant, k=10):
"""
NDCG@K: Normalized Discounted Cumulative Gain
Accounts for position of relevant items
"""
dcg = 0
for i, item in enumerate(recommended[:k]):
if item in relevant:
dcg += 1 / np.log2(i + 2) # +2 because index starts at 0
# Ideal DCG
idcg = sum([1 / np.log2(i + 2) for i in range(min(len(relevant), k))])
return dcg / idcg if idcg > 0 else 0Key Takeaways
Content-Based Filtering:
Collaborative Filtering:
Production Considerations:
What's Next?
In Part 2, we'll dive into:
These techniques form the foundation of modern recommender systems used at companies like Netflix, Spotify, and Amazon.
Want to discuss recommender systems? Connect with me on [LinkedIn](https://www.linkedin.com/in/prashantjha-ds) or drop an email!