Recommendation system metrics are quantitative measures used to evaluate the performance and effectiveness of recommendation algorithms. These metrics help assess how well a system can predict user preferences, rank items, and provide valuable recommendations.
When to use which metric
| Metric | When to use |
|---|---|
| Precision@k | User sees only a few recs — those few must be highly relevant. |
| Recall@k | Retrieve as many relevant items as possible from a large catalog. |
| Hit Rate@k | Overall system effectiveness — does any rec hit? |
| MAP@k | Ranked recs where order matters and you want average precision across users. |
| NDCG@k | Ranked recs with graded relevance — higher positions matter more. |
| MRR@k | The first good rec matters most (search, top-of-feed). |
| ILD@k | Guard against filter-bubble — diversity inside a single list. |
| Novelty@k | Push users toward non-popular items. |
| Serendipity@k | Unexpected and relevant — delight discoveries. |
| Coverage | Long-tail health — what fraction of the catalog ever gets recommended. |
| CTR | Online click behaviour on recommended items. |
| Conversion Rate | Downstream action (purchase, signup) per recommendation. |
| User Satisfaction | Survey / explicit feedback. |
Precision@k
Proportion of relevant items among the top-k recommendations. Useful when the user sees only a few recommendations and we want those few to be highly relevant.
Recall@k
Proportion of relevant items that are present in the top-k recommendations. Useful when we want to retrieve as many relevant items as possible from a large catalog.
Hit Rate@k
Proportion of users for whom at least one relevant item appears in their top-k recommendations. Good for overall system effectiveness. Does not differentiate between one and multiple relevant recommendations.
Mean Average Precision (MAP@k)
Mean of Average Precision (AP) across all users, where AP is the average of precision values at each relevant position in the ranked recommendations. Useful for ranked recommendations where order matters.
Where:
- is the number of relevant items for the user
- is an indicator function (1 if the item at position is relevant, 0 otherwise)
Normalized Discounted Cumulative Gain (NDCG@k)
Measures the quality of ranking by assigning higher weights to relevant items appearing higher in the list and normalizing by the ideal ranking. Penalizes relevant items appearing lower in the list.
Mean Reciprocal Rank (MRR@k)
Average of reciprocal ranks of the first relevant item across all users. Useful when the first good recommendation is most important (search engines).
Diversity
Measures how diverse the recommended items are across various dimensions. Helps prevent the “filter bubble” phenomenon.
Intra-List Diversity (ILD@k) — the average pairwise dissimilarity between items in a recommendation list.
Where is the distance or dissimilarity between items and .
Novelty
Measures how unusual or unfamiliar the recommended items are to users. Helps users discover new content beyond popular items.
Serendipity
Measures how unexpected yet relevant the recommendations are. Aims to delight users with discoveries they wouldn’t have found on their own.
Where:
- is the unexpectedness of item (often calculated as dissimilarity from user’s profile)
- is the relevance of item
Coverage
Item Coverage — the proportion of all available items that are recommended to at least one user. Helps prevent the “long-tail” problem where many items are never recommended.
User Coverage — the proportion of users who receive at least one recommendation.
Conversion Rate
Percentage of recommendations that lead to a desired action (e.g., click, purchase).
Click-Through Rate (CTR)
Ratio of clicks to impressions for recommended items.
User Satisfaction
Direct measurement of user satisfaction with recommendations, often collected through surveys or feedback mechanisms.