synthetic datasets

Large Language Models as Assessors: On the Impact of Relevance Scales

Traditionally, relevance judgments have relied on human annotators, but recent advances in Large Language Models (LLMs) have prompted growing interest in their use as a proxy for relevance judgments. In this setting, a key yet underexplored factor is …

PILs of Knowledge: A Synthetic Benchmark for Evaluating Question Answering Systems in Healthcare

Patient Information Leaflets (PILs) provide essential information about medication usage, side effects, precautions, and interactions, making them a valuable resource for Question Answering (QA) systems in healthcare. However, no dedicated benchmark …