Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network Science

Gaia Simeoni, Michael Soprano, Riccardo Lunardi, Kevin Roitero, Stefano Mizzaro

March 2026

Abstract

Many analyses have been performed on Information Retrieval (IR) evaluation benchmarks. Benchmarking also plays a central role in evaluating the capabilities of Large Language Models (LLMs). In this paper, we apply an IR approach to LLM evaluation. Adapting a method developed for TREC test collections, we analyze LLM benchmark results through the lens of network science. We construct a bipartite graph between models and benchmark questions and apply Kleinberg’s HITS algorithm to uncover latent structure in the evaluation data. In this framework, model hubness quantifies a model’s tendency to perform well on easy questions, while question hubness captures its ability to discriminate between more and less effective models. We conduct experiments on seven multiple-choice QA benchmarks with a pool of 34 LLMs. Through this IR-inspired approach, we show that the ranking of models on leaderboards is strongly influenced by subsets of easy questions.

Type

Conference paper

Publication

Advances in Information Retrieval: 48th European Conference on Information Retrieval (ECIR 2026), Lecture Notes in Computer Science, vol. 16484, Springer, Cham. Conference Rank: CORE A; GGS A-.

Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network Science

Abstract

Related