RecSys23 Tutorial

On Challenges of Evaluating Recommender Systems in an Offline Setting

In the past 20 years, the area of Recommender Systems (RecSys) has gained significant attention from both academia and industry. We are not in short of research papers on various RecSys models or online systems from industry players. However, in terms of model evaluation in offline settings, many researchers simply follow the commonly adopted experiment setup, and have not zoomed into the unique characteristics of the RecSys problem. In this tutorial, I will briefly review the commonly adopted evaluations in RecSys then discuss the challenges of evaluating recommender systems in an offline setting. The main emphasis is the consideration of global timeline in the evaluation, particularly when a dataset covers user-item interactions that have been collected from a long time period.

Thanks all for your kind support at RecSys2023, here are the slides.

Tutorial Outline

This tutorial is prepared for 90 minutes, targeting on the students who are familiar with recommender systems in general. Hence, the tutorial can be considered as an intermediate to advanced level tutorial, with a specific focus on RecSys evaluation in an offline setting, from an accuracy perspective.

The tutorial will be organized in three parts. The first part is on the review of commonly used RecSys evaluations. The content for this part will be mainly based on two recent survey papers on RecSys evaluation [1, 16]. Different evaluation objectives and measures will be covered, including those measures that are used in industry like Click-through Rate (CTR), Conversion Rate (CVR), and Gross Merchandise Value (GMV).

The second part is on the revisit of the evaluation in an offline setting, particularly the observation of the global timeline. The key issue here is not what measures/metrics to use, but how these measures are computed from a dataset. We will start with the ill-defined popularity model. In essence, popularity is often considered as the simplest recommendation baseline and is widely used for comparison purpose in evaluation. In reality, popularity has a strong temporal perspective. However, in many evaluations reported in academic research, the temporal perspective has become transparent due to various challenges, like data sparsity. We will use real examples to illustrate how popularity works in reality and how popularity is defined and evaluated in research papers. From the popularity evaluation, we extend the discussion to data leakage and its impact on RecSys evaluation results [10, 13, 17]. As models are often developed to achieve better measures, if the evaluation is not conducted correctly, there might be an impact on the effectiveness of these models in reality. Following the data leakage, we will further discuss another potential issue of ignoring timeline in evaluation, the simplification of user preference modeling.

In the last part of the tutorial, there will be a summary of the criticism on RecSys, with the key focus from the evaluation perspective. Although there many large-scale empirical evaluations [5, 12, 13, 17, 18], there remain questions on reproducibility, and technical and theoretical flaws [4, 6]. We will also cover a bit on the challenges in evaluating RecSys from different perspectives in offline settings [2, 3, 14]. This tutorial is concluded with a fresh look at RecSys evaluation on how to conduct more meaningful evaluations by considering the global timeline [11].


  1. Bushra Alhijawi, Arafat Awajan, and Salam Fraihat. 2022. Survey on the Objectives of Recommender Systems: Measures, Solutions, Evaluation Methodology, and New Perspectives. ACM Comput. Surv. 55, 5, Article 93 (2022).
  2. Pablo Castells and Alistair Moffat. 2022. Offline recommender system evaluation: Challenges and new directions. AI Magazine 43, 2 (2022), 225–238.
  3. Hung-Hsuan Chen, Chu-An Chung, Hsin-Chien Huang, and Wen Tsui. 2017. Common Pitfalls in Training and Evaluating Recommender Systems. SIGKDD Explorations 19, 1 (2017), 37–45.
  4. Paolo Cremonesi and Dietmar Jannach. 2021. Progress in Recommender Systems Research: Crisis? What Crisis? AI Magazine 42, 3 (Nov. 2021), 43–54.
  5. Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In RecSys. ACM, 101–109.
  6. Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. 2021. A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research. ACM Trans. Inf. Syst. 39, 2, Article 20 (2021).
  7. Yitong Ji, Aixin Sun, Jie Zhang, and Chenliang Li. 2020. A Re-visit of the Popularity Baseline in Recommender Systems. In SIGIR. ACM, 1749–1752.
  8. Yitong Ji, Aixin Sun, Jie Zhang, and Chenliang Li. 2022. Do Loyal Users Enjoy Better Recommendations? Understanding Recommender Accuracy from a Time Perspective. In ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR) (Madrid, Spain). ACM, 92–97.
  9. Yitong Ji, Aixin Sun, Jie Zhang, and Chenliang Li. 2023. A Critical Study on Data Leakage in Recommender System Offline Evaluation. ACM Trans. Inf. Syst. 41, 3 (2023), 75:1–75:27.
  10. Zaiqiao Meng, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. Exploring Data Splitting Strategies for the Evaluation of Recommendation Models. In RecSys. ACM, 681–686.
  11. Aixin Sun. 2023. Take a Fresh Look at Recommender Systems from an Evaluation Standpoint. In SIGIR. ACM, 2629–2638.
  12. Zhu Sun, Hui Fang, Jie Yang, Xinghua Qu, Hongyang Liu, Di Yu, Yew-Soon Ong, and Jie Zhang. 2023. DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation. IEEE Trans. Pattern Anal. Mach. Intell. 45, 7 (2023), 8206–8226.
  13. Zhu Sun, Di Yu, Hui Fang, Jie Yang, Xinghua Qu, Jie Zhang, and Cong Geng. 2020. Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison. In RecSys. ACM, 23–32.
  14. Yan-Martin Tamm, Rinchin Damdinov, and Alexey Vasilev. 2021. Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?. In RecSys (Amsterdam, Netherlands). ACM, 708–713.
  15. Robin Verachtert, Lien Michiels, and Bart Goethals. 2022. Are We Forgetting Something? Correctly Evaluate a Recommender System With an Optimal Training Window. In Perspectives on the Evaluation of Recommender Systems Workshop (PERSPECTIVES) at RecSys22. Seattle, WA, USA.
  16. Eva Zangerle and Christine Bauer. 2022. Evaluating Recommender Systems: Survey and Framework. ACM Comput. Surv. 55, 8, Article 170 (dec 2022), 38 pages.
  17. Wayne Xin Zhao, Zihan Lin, Zhichao Feng, Pengfei Wang, and Ji-Rong Wen. 2022. A Revisiting Study of Appropriate Offline Evaluation for Top-N Recommendation Algorithms. ACM Trans. Inf. Syst. 41, 2, Article 32 (dec 2022), 41 pages.
  18. Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, and Rui Zhang. 2022. BARS: Towards Open Benchmarking for Recommender Systems. In SIGIR. ACM, 2912–2923.

Further readings

An Unusual List of Recommended Reading on RecSys