Traditionally, relevance judgments have relied on human annotators, but recent advances in Large Language Models (LLMs) have prompted growing interest in their use as a proxy for relevance judgments. In this setting, a key yet underexplored factor is the choice of relevance scale. Relevance scales range from binary to fine-grained ones, and their impact on the effectiveness of LLM-based judgments, the effects of scale conversions, and their role in the presence of potential data contamination remain unclear. We systematically investigate how different scales, and their conversions, affect LLMs’ ability to provide reliable relevance judgments across multiple prompting strategies and model sizes. Using a popular TREC collection, we compare model outputs with both crowd and expert annotations, analyzing alignment, stability, and signs of potential data contamination.