Building Recommender Systems - Part 1: Foundations

Published on December 16, 2025

Introduction

Recommender systems are everywhere—from Netflix suggesting your next binge-worthy show to Amazon recommending products you didn't know you needed. These systems power personalization at scale, driving engagement and revenue for businesses worldwide.

In this three-part series, we'll build recommender systems from the ground up, starting with fundamentals and progressing to advanced techniques used in production. By the end, you'll understand not just the theory, but how to implement these systems in Python with real datasets.

Series Roadmap:

Part 1 (This Post): Foundations, Content-Based & Collaborative Filtering

Part 2: Matrix Factorization and Factorization Machines

Part 3: Deep Learning Approaches and Production Systems

What is a Recommender System?

A recommender system predicts user preferences for items they haven't interacted with yet. The goal is to surface relevant content that maximizes user engagement and satisfaction.

Real-World Examples

Netflix: "Because you watched Stranger Things..."

Analyzes viewing history, ratings, and behavior patterns

Considers content attributes (genre, actors, directors)

Personalizes the homepage for each user

Amazon: "Customers who bought this also bought..."

Tracks purchase history and browsing behavior

Uses collaborative patterns across millions of users

Drives 35% of total revenue through recommendations

Spotify: Discover Weekly playlist

Learns from listening patterns and skips

Analyzes audio features and playlist co-occurrence

Generates 40+ million personalized playlists weekly

LinkedIn: "People You May Know"

Leverages network structure and shared connections

Considers profile similarity and interaction patterns

Drives network growth and platform engagement

Types of Recommender Systems

1. Content-Based Filtering

Core Idea: Recommend items similar to what the user liked in the past.

How it works:

Build profiles for users and items based on features

Calculate similarity between items

Recommend items similar to user's past preferences

Example: If you liked "The Matrix" (sci-fi, action, 1999), recommend "Inception" (sci-fi, action, 2010)

Advantages:

No cold-start problem for new items

Transparent recommendations (explainable)

User independence (no need for other users' data)

Limitations:

Limited serendipity (stuck in a filter bubble)

Requires rich item features

Can't discover new interests

2. Collaborative Filtering

Core Idea: Recommend items that similar users liked.

Collaborative Filtering Diagram

How it works:

Find users with similar taste (user-based)

Or find items with similar rating patterns (item-based)

Recommend based on collective wisdom

Example: Users who liked movies A, B, and C also liked movie D → recommend D

Advantages:

No need for item features

Discovers unexpected preferences

Leverages collective intelligence

Limitations:

Cold-start problem for new users/items

Sparsity issues (most users rate few items)

Scalability challenges

3. Hybrid Approaches

Combine multiple techniques to get the best of both worlds. We'll cover these in Parts 2 and 3.

Building a Content-Based Recommender

Let's build a movie recommender using the MovieLens dataset. We'll use movie features like genres, directors, and cast.

Step 1: Setup and Data Loading

python

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load MovieLens data
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

# Sample data structure
# movies: movieId, title, genres
# ratings: userId, movieId, rating, timestamp

print(movies.head())
#    movieId                    title                 genres
# 0        1         Toy Story (1995)  Adventure|Animation|Children
# 1        2           Jumanji (1995)  Adventure|Children|Fantasy
# 2        3  Grumpier Old Men (1995)              Comedy|Romance

Step 2: Create Item Profiles

We'll use TF-IDF to convert genres into numerical vectors.

python

# Create a combined feature from genres (you can add more features)
movies['features'] = movies['genres'].str.replace('|', ' ')

# Create TF-IDF vectors
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['features'])

print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")
# Output: TF-IDF Matrix Shape: (9742, 20)
# 9742 movies, 20 unique genre tokens

Step 3: Calculate Item Similarity

python

# Compute cosine similarity between all movie pairs
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Create a mapping from movie titles to indices
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

print(f"Similarity matrix shape: {cosine_sim.shape}")
# Output: (9742, 9742)

Step 4: Generate Recommendations

python

def get_content_recommendations(title, top_n=10):
    """
    Get top N movie recommendations based on content similarity

    Args:
        title: Movie title to find recommendations for
        top_n: Number of recommendations to return

    Returns:
        List of recommended movie titles
    """
    # Get the index of the movie
    idx = indices[title]

    # Get similarity scores for all movies with this movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort movies by similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get top N most similar movies (excluding itself)
    sim_scores = sim_scores[1:top_n+1]

    # Get movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return movie titles and scores
    recommendations = movies.iloc[movie_indices][['title', 'genres']].copy()
    recommendations['similarity_score'] = [score[1] for score in sim_scores]

    return recommendations

# Test the recommender
recommendations = get_content_recommendations('Toy Story (1995)', top_n=5)
print(recommendations)

Expected Output:

code

                              title                          genres  similarity_score
Toy Story 2 (1999)    Adventure|Animation|Children          0.92
Antz (1998)           Adventure|Animation|Children          0.87
Monsters, Inc. (2001) Adventure|Animation|Children          0.87
Finding Nemo (2003)   Adventure|Animation|Children          0.87
Shrek (2001)          Adventure|Animation|Children          0.87

Step 5: Enhance with More Features

python

# If you have more metadata (cast, director, keywords)
movies['combined_features'] = (
    movies['genres'].fillna('') + ' ' +
    movies['director'].fillna('') + ' ' +
    movies['cast'].fillna('') + ' ' +
    movies['keywords'].fillna('')
)

# Different weights for different features
from sklearn.feature_extraction.text import TfidfVectorizer

class WeightedTfidfVectorizer:
    def __init__(self):
        self.genre_vec = TfidfVectorizer(stop_words='english')
        self.cast_vec = TfidfVectorizer(stop_words='english', max_features=50)

    def fit_transform(self, movies_df):
        genre_matrix = self.genre_vec.fit_transform(movies_df['genres'])
        cast_matrix = self.cast_vec.fit_transform(movies_df['cast'])

        # Combine with weights (genres: 0.6, cast: 0.4)
        from scipy.sparse import hstack
        combined = hstack([genre_matrix * 0.6, cast_matrix * 0.4])

        return combined

weighted_vec = WeightedTfidfVectorizer()
weighted_matrix = weighted_vec.fit_transform(movies)

Building a Collaborative Filtering Recommender

Now let's build a user-based collaborative filter that finds similar users and recommends based on their preferences.

Step 1: Create User-Item Matrix

python

# Create user-item rating matrix
user_item_matrix = ratings.pivot_table(
    index='userId',
    columns='movieId',
    values='rating'
)

print(f"Matrix shape: {user_item_matrix.shape}")
# Output: Matrix shape: (610, 9724)
# 610 users, 9724 movies

# Check sparsity
sparsity = 1 - (user_item_matrix.count().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]))
print(f"Sparsity: {sparsity:.2%}")
# Output: Sparsity: 98.30%
# Most entries are missing - typical for real-world data!

Step 2: Calculate User Similarity

python

from sklearn.metrics.pairwise import cosine_similarity

# Fill NaN with 0 for similarity calculation
user_item_filled = user_item_matrix.fillna(0)

# Calculate user-user similarity
user_similarity = cosine_similarity(user_item_filled)

# Convert to DataFrame for easier manipulation
user_similarity_df = pd.DataFrame(
    user_similarity,
    index=user_item_matrix.index,
    columns=user_item_matrix.index
)

print(user_similarity_df.head())

Step 3: Generate User-Based Recommendations

python

def get_collaborative_recommendations(user_id, top_n=10, n_similar_users=5):
    """
    Get recommendations using user-based collaborative filtering

    Args:
        user_id: Target user ID
        top_n: Number of recommendations to return
        n_similar_users: Number of similar users to consider

    Returns:
        List of recommended movie IDs with predicted ratings
    """
    # Get similar users (excluding the user itself)
    similar_users = user_similarity_df[user_id].sort_values(ascending=False)[1:n_similar_users+1]

    # Get movies rated by similar users but not by target user
    user_ratings = user_item_matrix.loc[user_id]
    unrated_movies = user_ratings[user_ratings.isna()].index

    # Predict ratings for unrated movies
    predictions = {}

    for movie_id in unrated_movies:
        # Get ratings from similar users
        similar_user_ratings = user_item_matrix.loc[similar_users.index, movie_id]

        # Remove NaN values
        valid_ratings = similar_user_ratings.dropna()

        if len(valid_ratings) > 0:
            # Weighted average based on user similarity
            weights = similar_users[valid_ratings.index]
            predicted_rating = np.average(valid_ratings, weights=weights)
            predictions[movie_id] = predicted_rating

    # Sort by predicted rating
    sorted_predictions = sorted(predictions.items(), key=lambda x: x[1], reverse=True)

    # Get top N recommendations
    top_recommendations = sorted_predictions[:top_n]

    # Convert to DataFrame with movie titles
    rec_df = pd.DataFrame(top_recommendations, columns=['movieId', 'predicted_rating'])
    rec_df = rec_df.merge(movies[['movieId', 'title', 'genres']], on='movieId')

    return rec_df

# Test the recommender
user_recommendations = get_collaborative_recommendations(user_id=1, top_n=5)
print(user_recommendations)

Expected Output:

code

   movieId  predicted_rating                         title                 genres
0     2571              4.8  Matrix, The (1999)            Action|Sci-Fi|Thriller
1     2959              4.7  Fight Club (1999)             Action|Crime|Drama
2     1196              4.6  Star Wars: Episode V (1980)   Action|Adventure|Sci-Fi
3     4993              4.5  Lord of the Rings (2001)      Adventure|Fantasy
4      858              4.4  Godfather, The (1972)         Crime|Drama

Step 4: Item-Based Collaborative Filtering

Item-based filtering is often more scalable and stable than user-based.

python

# Transpose to get item-item matrix
item_user_matrix = user_item_matrix.T

# Calculate item-item similarity
item_similarity = cosine_similarity(item_user_matrix.fillna(0))

item_similarity_df = pd.DataFrame(
    item_similarity,
    index=user_item_matrix.columns,
    columns=user_item_matrix.columns
)

def get_item_based_recommendations(user_id, top_n=10):
    """
    Get recommendations using item-based collaborative filtering
    """
    # Get user's rated movies
    user_ratings = user_item_matrix.loc[user_id].dropna()

    # For each rated movie, find similar movies
    recommendations = {}

    for movie_id, rating in user_ratings.items():
        # Get similar items
        similar_items = item_similarity_df[movie_id].sort_values(ascending=False)[1:11]

        for sim_movie_id, similarity in similar_items.items():
            if pd.isna(user_item_matrix.loc[user_id, sim_movie_id]):
                # Weight by user's rating and item similarity
                score = rating * similarity

                if sim_movie_id in recommendations:
                    recommendations[sim_movie_id] += score
                else:
                    recommendations[sim_movie_id] = score

    # Sort and get top N
    sorted_recs = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)[:top_n]

    # Create result DataFrame
    rec_df = pd.DataFrame(sorted_recs, columns=['movieId', 'score'])
    rec_df = rec_df.merge(movies[['movieId', 'title', 'genres']], on='movieId')

    return rec_df

# Test item-based recommender
item_recs = get_item_based_recommendations(user_id=1, top_n=5)
print(item_recs)

Evaluation Metrics

How do we know if our recommender is good?

1. Accuracy Metrics

python

from sklearn.metrics import mean_squared_error, mean_absolute_error

def evaluate_recommender(predictions, actuals):
    """
    Evaluate recommender system predictions
    """
    # RMSE (Root Mean Squared Error)
    rmse = np.sqrt(mean_squared_error(actuals, predictions))

    # MAE (Mean Absolute Error)
    mae = mean_absolute_error(actuals, predictions)

    return {'RMSE': rmse, 'MAE': mae}

# Example evaluation
print(evaluate_recommender(predicted_ratings, actual_ratings))
# Output: {'RMSE': 0.87, 'MAE': 0.68}

2. Ranking Metrics

python

def precision_at_k(recommended, relevant, k=10):
    """
    Precision@K: Proportion of recommended items that are relevant
    """
    recommended_k = recommended[:k]
    relevant_recommended = len(set(recommended_k) & set(relevant))
    return relevant_recommended / k

def recall_at_k(recommended, relevant, k=10):
    """
    Recall@K: Proportion of relevant items that are recommended
    """
    recommended_k = recommended[:k]
    relevant_recommended = len(set(recommended_k) & set(relevant))
    return relevant_recommended / len(relevant) if len(relevant) > 0 else 0

def ndcg_at_k(recommended, relevant, k=10):
    """
    NDCG@K: Normalized Discounted Cumulative Gain
    Accounts for position of relevant items
    """
    dcg = 0
    for i, item in enumerate(recommended[:k]):
        if item in relevant:
            dcg += 1 / np.log2(i + 2)  # +2 because index starts at 0

    # Ideal DCG
    idcg = sum([1 / np.log2(i + 2) for i in range(min(len(relevant), k))])

    return dcg / idcg if idcg > 0 else 0

Key Takeaways

Content-Based Filtering:

Pros: No cold-start for items, explainable, user-independent

Cons: Limited discovery, requires feature engineering

Use when: You have rich item metadata, want explainability

Collaborative Filtering:

Pros: No feature engineering, discovers unexpected patterns

Cons: Cold-start problem, sparsity issues

Use when: You have interaction data, want serendipity

Production Considerations:

Start simple: Item-based CF is often the best baseline

Handle sparsity: Use implicit feedback (clicks, views) not just ratings

Scale matters: Precompute similarities, use approximate nearest neighbors

Diversity: Don't just recommend similar items—add randomness

Freshness: Update recommendations regularly with new data

What's Next?

In Part 2, we'll dive into:

Matrix Factorization: SVD, ALS, and how they power Netflix-scale recommendations

Factorization Machines: Incorporating side features for better predictions

Handling implicit feedback: Clicks, views, and session data

Production optimization: Making recommendations fast and scalable

These techniques form the foundation of modern recommender systems used at companies like Netflix, Spotify, and Amazon.

Want to discuss recommender systems? Connect with me on [LinkedIn](https://www.linkedin.com/in/prashantjha-ds) or drop an email!