Exploring the limitations of LLMs-as-a-Judge

Posted by TelloLeEngineer@reddit | LocalLLaMA | View on Reddit | 1 comments

LLMs are notoriously bad at handling numeric ranges which is impractical given their otherwise impressive ability of evaluating complex, open ended tasks. Given their increasing use as evaluators, it's crucial to understand their inherent biases. You may have seen the recent post from a team at Arize, where they study the ability of LLMs to evaluate using numeric ranges. Concretely, they test GPT-4’s ability to grade misspelled texts of varying severity. To verify their findings, I replicated the experiment, and the results are as follows.

Note the perfect linear range, which depicts the desired outcome of a linear correlation between LLM Evaluation Score and misspelled %. Okay great, so far, nothing new. Despite this apparent inability however, we know there is a strong correlation between LLM and human evaluators. For example MT-Bench shows a 0.9 correlation with Arena Elo. This prompts the question, can we use improved prompt techniques or scoring guidelines to better correlate the scores depicted above? Arize AI left things off quite open in their study and I'm keen to explore their results further. To that end I set up a repo to document my experiments and I'd like to share the results from the initial tests

Reversed.

What happens if we reverse the scoring guidelines, making 10 the perfect score?

Grades

Given the statements from Arize, what happens if we discard the numeric scores and just ask for "grade labels".

CoT

On of the authors of Prometheus suggested that you provide a full mapping of explanations across the entire scoring matrix, combined with Chain-of-Thought.

This is an ongoing exploration, would love to hear your thoughts!