Many analyses have been performed on Information Retrieval (IR) evaluation benchmarks. Benchmarking also plays a central role in evaluating the capabilities of Large Language Models (LLMs). In this paper, we apply an IR approach to LLM evaluation. Adapting a method developed for TREC test collections, we analyze LLM benchmark results through the lens of network science. We construct a bipartite graph between models and benchmark questions and apply Kleinberg’s HITS algorithm to uncover latent structure in the evaluation data. In this framework, model hubness quantifies a model’s tendency to perform well on easy questions, while question hubness captures its ability to discriminate between more and less effective models. We conduct experiments on seven multiple-choice QA benchmarks with a pool of 34 LLMs. Through this IR-inspired approach, we show that the ranking of models on leaderboards is strongly influenced by subsets of easy questions.