Building Recommender Systems - Part 1: Foundations

Published on December 16, 2025

Introduction

Recommender systems are everywhere—from Netflix suggesting your next binge-worthy show to Amazon recommending products you didn't know you needed. These systems power personalization at scale, driving engagement and revenue for businesses worldwide.

In this three-part series, we'll build recommender systems from the ground up, starting with fundamentals and progressing to advanced techniques used in production. By the end, you'll understand not just the theory, but how to implement these systems in Python with real datasets.

Series Roadmap:

  • Part 1 (This Post): Foundations, Content-Based & Collaborative Filtering
  • Part 2: Matrix Factorization and Factorization Machines
  • Part 3: Deep Learning Approaches and Production Systems
  • What is a Recommender System?

    A recommender system predicts user preferences for items they haven't interacted with yet. The goal is to surface relevant content that maximizes user engagement and satisfaction.

    Real-World Examples

    Netflix: "Because you watched Stranger Things..."

  • Analyzes viewing history, ratings, and behavior patterns
  • Considers content attributes (genre, actors, directors)
  • Personalizes the homepage for each user
  • Amazon: "Customers who bought this also bought..."

  • Tracks purchase history and browsing behavior
  • Uses collaborative patterns across millions of users
  • Drives 35% of total revenue through recommendations
  • Spotify: Discover Weekly playlist

  • Learns from listening patterns and skips
  • Analyzes audio features and playlist co-occurrence
  • Generates 40+ million personalized playlists weekly
  • LinkedIn: "People You May Know"

  • Leverages network structure and shared connections
  • Considers profile similarity and interaction patterns
  • Drives network growth and platform engagement
  • Types of Recommender Systems

    1. Content-Based Filtering

    Core Idea: Recommend items similar to what the user liked in the past.

    How it works:

  • Build profiles for users and items based on features
  • Calculate similarity between items
  • Recommend items similar to user's past preferences
  • Example: If you liked "The Matrix" (sci-fi, action, 1999), recommend "Inception" (sci-fi, action, 2010)

    Advantages:

  • No cold-start problem for new items
  • Transparent recommendations (explainable)
  • User independence (no need for other users' data)
  • Limitations:

  • Limited serendipity (stuck in a filter bubble)
  • Requires rich item features
  • Can't discover new interests
  • 2. Collaborative Filtering

    Core Idea: Recommend items that similar users liked.

    Collaborative Filtering Diagram

    Collaborative Filtering Diagram

    How it works:

  • Find users with similar taste (user-based)
  • Or find items with similar rating patterns (item-based)
  • Recommend based on collective wisdom
  • Example: Users who liked movies A, B, and C also liked movie D → recommend D

    Advantages:

  • No need for item features
  • Discovers unexpected preferences
  • Leverages collective intelligence
  • Limitations:

  • Cold-start problem for new users/items
  • Sparsity issues (most users rate few items)
  • Scalability challenges
  • 3. Hybrid Approaches

    Combine multiple techniques to get the best of both worlds. We'll cover these in Parts 2 and 3.

    Building a Content-Based Recommender

    Let's build a movie recommender using the MovieLens dataset. We'll use movie features like genres, directors, and cast.

    Step 1: Setup and Data Loading

    python
    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    # Load MovieLens data
    movies = pd.read_csv('movies.csv')
    ratings = pd.read_csv('ratings.csv')
    
    # Sample data structure
    # movies: movieId, title, genres
    # ratings: userId, movieId, rating, timestamp
    
    print(movies.head())
    #    movieId                    title                 genres
    # 0        1         Toy Story (1995)  Adventure|Animation|Children
    # 1        2           Jumanji (1995)  Adventure|Children|Fantasy
    # 2        3  Grumpier Old Men (1995)              Comedy|Romance

    Step 2: Create Item Profiles

    We'll use TF-IDF to convert genres into numerical vectors.

    python
    # Create a combined feature from genres (you can add more features)
    movies['features'] = movies['genres'].str.replace('|', ' ')
    
    # Create TF-IDF vectors
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(movies['features'])
    
    print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")
    # Output: TF-IDF Matrix Shape: (9742, 20)
    # 9742 movies, 20 unique genre tokens

    Step 3: Calculate Item Similarity

    python
    # Compute cosine similarity between all movie pairs
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    
    # Create a mapping from movie titles to indices
    indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()
    
    print(f"Similarity matrix shape: {cosine_sim.shape}")
    # Output: (9742, 9742)

    Step 4: Generate Recommendations

    python
    def get_content_recommendations(title, top_n=10):
        """
        Get top N movie recommendations based on content similarity
    
        Args:
            title: Movie title to find recommendations for
            top_n: Number of recommendations to return
    
        Returns:
            List of recommended movie titles
        """
        # Get the index of the movie
        idx = indices[title]
    
        # Get similarity scores for all movies with this movie
        sim_scores = list(enumerate(cosine_sim[idx]))
    
        # Sort movies by similarity score
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
        # Get top N most similar movies (excluding itself)
        sim_scores = sim_scores[1:top_n+1]
    
        # Get movie indices
        movie_indices = [i[0] for i in sim_scores]
    
        # Return movie titles and scores
        recommendations = movies.iloc[movie_indices][['title', 'genres']].copy()
        recommendations['similarity_score'] = [score[1] for score in sim_scores]
    
        return recommendations
    
    # Test the recommender
    recommendations = get_content_recommendations('Toy Story (1995)', top_n=5)
    print(recommendations)

    Expected Output:

    code
                                  title                          genres  similarity_score
    Toy Story 2 (1999)    Adventure|Animation|Children          0.92
    Antz (1998)           Adventure|Animation|Children          0.87
    Monsters, Inc. (2001) Adventure|Animation|Children          0.87
    Finding Nemo (2003)   Adventure|Animation|Children          0.87
    Shrek (2001)          Adventure|Animation|Children          0.87

    Step 5: Enhance with More Features

    python
    # If you have more metadata (cast, director, keywords)
    movies['combined_features'] = (
        movies['genres'].fillna('') + ' ' +
        movies['director'].fillna('') + ' ' +
        movies['cast'].fillna('') + ' ' +
        movies['keywords'].fillna('')
    )
    
    # Different weights for different features
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    class WeightedTfidfVectorizer:
        def __init__(self):
            self.genre_vec = TfidfVectorizer(stop_words='english')
            self.cast_vec = TfidfVectorizer(stop_words='english', max_features=50)
    
        def fit_transform(self, movies_df):
            genre_matrix = self.genre_vec.fit_transform(movies_df['genres'])
            cast_matrix = self.cast_vec.fit_transform(movies_df['cast'])
    
            # Combine with weights (genres: 0.6, cast: 0.4)
            from scipy.sparse import hstack
            combined = hstack([genre_matrix * 0.6, cast_matrix * 0.4])
    
            return combined
    
    weighted_vec = WeightedTfidfVectorizer()
    weighted_matrix = weighted_vec.fit_transform(movies)

    Building a Collaborative Filtering Recommender

    Now let's build a user-based collaborative filter that finds similar users and recommends based on their preferences.

    Step 1: Create User-Item Matrix

    python
    # Create user-item rating matrix
    user_item_matrix = ratings.pivot_table(
        index='userId',
        columns='movieId',
        values='rating'
    )
    
    print(f"Matrix shape: {user_item_matrix.shape}")
    # Output: Matrix shape: (610, 9724)
    # 610 users, 9724 movies
    
    # Check sparsity
    sparsity = 1 - (user_item_matrix.count().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]))
    print(f"Sparsity: {sparsity:.2%}")
    # Output: Sparsity: 98.30%
    # Most entries are missing - typical for real-world data!

    Step 2: Calculate User Similarity

    python
    from sklearn.metrics.pairwise import cosine_similarity
    
    # Fill NaN with 0 for similarity calculation
    user_item_filled = user_item_matrix.fillna(0)
    
    # Calculate user-user similarity
    user_similarity = cosine_similarity(user_item_filled)
    
    # Convert to DataFrame for easier manipulation
    user_similarity_df = pd.DataFrame(
        user_similarity,
        index=user_item_matrix.index,
        columns=user_item_matrix.index
    )
    
    print(user_similarity_df.head())

    Step 3: Generate User-Based Recommendations

    python
    def get_collaborative_recommendations(user_id, top_n=10, n_similar_users=5):
        """
        Get recommendations using user-based collaborative filtering
    
        Args:
            user_id: Target user ID
            top_n: Number of recommendations to return
            n_similar_users: Number of similar users to consider
    
        Returns:
            List of recommended movie IDs with predicted ratings
        """
        # Get similar users (excluding the user itself)
        similar_users = user_similarity_df[user_id].sort_values(ascending=False)[1:n_similar_users+1]
    
        # Get movies rated by similar users but not by target user
        user_ratings = user_item_matrix.loc[user_id]
        unrated_movies = user_ratings[user_ratings.isna()].index
    
        # Predict ratings for unrated movies
        predictions = {}
    
        for movie_id in unrated_movies:
            # Get ratings from similar users
            similar_user_ratings = user_item_matrix.loc[similar_users.index, movie_id]
    
            # Remove NaN values
            valid_ratings = similar_user_ratings.dropna()
    
            if len(valid_ratings) > 0:
                # Weighted average based on user similarity
                weights = similar_users[valid_ratings.index]
                predicted_rating = np.average(valid_ratings, weights=weights)
                predictions[movie_id] = predicted_rating
    
        # Sort by predicted rating
        sorted_predictions = sorted(predictions.items(), key=lambda x: x[1], reverse=True)
    
        # Get top N recommendations
        top_recommendations = sorted_predictions[:top_n]
    
        # Convert to DataFrame with movie titles
        rec_df = pd.DataFrame(top_recommendations, columns=['movieId', 'predicted_rating'])
        rec_df = rec_df.merge(movies[['movieId', 'title', 'genres']], on='movieId')
    
        return rec_df
    
    # Test the recommender
    user_recommendations = get_collaborative_recommendations(user_id=1, top_n=5)
    print(user_recommendations)

    Expected Output:

    code
       movieId  predicted_rating                         title                 genres
    0     2571              4.8  Matrix, The (1999)            Action|Sci-Fi|Thriller
    1     2959              4.7  Fight Club (1999)             Action|Crime|Drama
    2     1196              4.6  Star Wars: Episode V (1980)   Action|Adventure|Sci-Fi
    3     4993              4.5  Lord of the Rings (2001)      Adventure|Fantasy
    4      858              4.4  Godfather, The (1972)         Crime|Drama

    Step 4: Item-Based Collaborative Filtering

    Item-based filtering is often more scalable and stable than user-based.

    python
    # Transpose to get item-item matrix
    item_user_matrix = user_item_matrix.T
    
    # Calculate item-item similarity
    item_similarity = cosine_similarity(item_user_matrix.fillna(0))
    
    item_similarity_df = pd.DataFrame(
        item_similarity,
        index=user_item_matrix.columns,
        columns=user_item_matrix.columns
    )
    
    def get_item_based_recommendations(user_id, top_n=10):
        """
        Get recommendations using item-based collaborative filtering
        """
        # Get user's rated movies
        user_ratings = user_item_matrix.loc[user_id].dropna()
    
        # For each rated movie, find similar movies
        recommendations = {}
    
        for movie_id, rating in user_ratings.items():
            # Get similar items
            similar_items = item_similarity_df[movie_id].sort_values(ascending=False)[1:11]
    
            for sim_movie_id, similarity in similar_items.items():
                if pd.isna(user_item_matrix.loc[user_id, sim_movie_id]):
                    # Weight by user's rating and item similarity
                    score = rating * similarity
    
                    if sim_movie_id in recommendations:
                        recommendations[sim_movie_id] += score
                    else:
                        recommendations[sim_movie_id] = score
    
        # Sort and get top N
        sorted_recs = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)[:top_n]
    
        # Create result DataFrame
        rec_df = pd.DataFrame(sorted_recs, columns=['movieId', 'score'])
        rec_df = rec_df.merge(movies[['movieId', 'title', 'genres']], on='movieId')
    
        return rec_df
    
    # Test item-based recommender
    item_recs = get_item_based_recommendations(user_id=1, top_n=5)
    print(item_recs)

    Evaluation Metrics

    How do we know if our recommender is good?

    1. Accuracy Metrics

    python
    from sklearn.metrics import mean_squared_error, mean_absolute_error
    
    def evaluate_recommender(predictions, actuals):
        """
        Evaluate recommender system predictions
        """
        # RMSE (Root Mean Squared Error)
        rmse = np.sqrt(mean_squared_error(actuals, predictions))
    
        # MAE (Mean Absolute Error)
        mae = mean_absolute_error(actuals, predictions)
    
        return {'RMSE': rmse, 'MAE': mae}
    
    # Example evaluation
    print(evaluate_recommender(predicted_ratings, actual_ratings))
    # Output: {'RMSE': 0.87, 'MAE': 0.68}

    2. Ranking Metrics

    python
    def precision_at_k(recommended, relevant, k=10):
        """
        Precision@K: Proportion of recommended items that are relevant
        """
        recommended_k = recommended[:k]
        relevant_recommended = len(set(recommended_k) & set(relevant))
        return relevant_recommended / k
    
    def recall_at_k(recommended, relevant, k=10):
        """
        Recall@K: Proportion of relevant items that are recommended
        """
        recommended_k = recommended[:k]
        relevant_recommended = len(set(recommended_k) & set(relevant))
        return relevant_recommended / len(relevant) if len(relevant) > 0 else 0
    
    def ndcg_at_k(recommended, relevant, k=10):
        """
        NDCG@K: Normalized Discounted Cumulative Gain
        Accounts for position of relevant items
        """
        dcg = 0
        for i, item in enumerate(recommended[:k]):
            if item in relevant:
                dcg += 1 / np.log2(i + 2)  # +2 because index starts at 0
    
        # Ideal DCG
        idcg = sum([1 / np.log2(i + 2) for i in range(min(len(relevant), k))])
    
        return dcg / idcg if idcg > 0 else 0

    Key Takeaways

    Content-Based Filtering:

  • Pros: No cold-start for items, explainable, user-independent
  • Cons: Limited discovery, requires feature engineering
  • Use when: You have rich item metadata, want explainability
  • Collaborative Filtering:

  • Pros: No feature engineering, discovers unexpected patterns
  • Cons: Cold-start problem, sparsity issues
  • Use when: You have interaction data, want serendipity
  • Production Considerations:

  • Start simple: Item-based CF is often the best baseline
  • Handle sparsity: Use implicit feedback (clicks, views) not just ratings
  • Scale matters: Precompute similarities, use approximate nearest neighbors
  • Diversity: Don't just recommend similar items—add randomness
  • Freshness: Update recommendations regularly with new data
  • What's Next?

    In Part 2, we'll dive into:

  • Matrix Factorization: SVD, ALS, and how they power Netflix-scale recommendations
  • Factorization Machines: Incorporating side features for better predictions
  • Handling implicit feedback: Clicks, views, and session data
  • Production optimization: Making recommendations fast and scalable
  • These techniques form the foundation of modern recommender systems used at companies like Netflix, Spotify, and Amazon.


    Want to discuss recommender systems? Connect with me on [LinkedIn](https://www.linkedin.com/in/prashantjha-ds) or drop an email!