Large Language Models as Assessors: On the Impact of Relevance Scales

Riccardo Zamolo, Riccardo Lunardi, Michael Soprano, Gianluca Demartini, Stefano Mizzaro, Kevin Roitero

March 2026

Abstract

Traditionally, relevance judgments have relied on human annotators, but recent advances in Large Language Models (LLMs) have prompted growing interest in their use as a proxy for relevance judgments. In this setting, a key yet underexplored factor is the choice of relevance scale. Relevance scales range from binary to fine-grained ones, and their impact on the effectiveness of LLM-based judgments, the effects of scale conversions, and their role in the presence of potential data contamination remain unclear. We systematically investigate how different scales, and their conversions, affect LLMs’ ability to provide reliable relevance judgments across multiple prompting strategies and model sizes. Using a popular TREC collection, we compare model outputs with both crowd and expert annotations, analyzing alignment, stability, and signs of potential data contamination.

Type

Conference paper

Publication

Advances in Information Retrieval: 48th European Conference on Information Retrieval (ECIR 2026). Conference Rank: CORE A; GGS A-.

Large Language Models as Assessors: On the Impact of Relevance Scales

Abstract

Related