Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Your LLM Judge may be biased, published by Rachel Freedman on March 29, 2024 on The AI Alignment Forum.
Abstract
AI safety researchers often rely on LLM "judges" to qualitatively evaluate the output of separate LLMs. We try this for our own interpretability research, but find that our LLM judges are often deeply biased. For example, we use Llama2 to judge whether movie reviews are more "(A) positive" or "(B) negative", and find that it almost always answers "(B)", even when we switch the labels or order of these alternatives.
This bias is particularly surprising for two reasons: first, because we expect a fairly capable model like Llama2 to perform well at a simple sentiment classification task like this, and second, because this specific "(B)"-bias doesn't map on to a human bias we'd expect to see in the training data. We describe our experiments, provide code to replicate our results, and offer suggestions to mitigate such biases.
We caution researchers to double-check their LLM judges for such biases, and validate LLM judgements against human ones whenever possible.
Introduction
Researchers often rely on human judgements for many simple and repetitive tasks, such as comparing, evaluating and classifying text generated by LLMs. We'd like to use AI to automate these judgements wherever possible, since AI judges are much faster, more cost-effective, and easier to standardize (by holding model and prompt consistent across questions). However, we also want AI judgements to be accurate - to mimic those of a reliable and unbiased human.
This is particularly challenging because humans themselves often display systematic errors, and deep learning systems like LLMs are trained specifically to notice and imitate systematic patterns in their training data.
Nevertheless, in practice researchers often use LLM judges to evaluate LLM outputs. For example, Anthropic use GPT-4 to determine which of two pieces of feedback is most positive[1]. Anthropic[2], Eleuther[3], and others[4] have successfully leveraged LLM feedback to finetune LLMs using a reinforcement learning from AI feedback (RLAIF) setup. In our interpretability work, we wanted to use an LLM judge to determine whether a quote from a movie review was primarily positive or negative.
We found our LLM judge to be biased, and in an unexpected way - in a multiple-choice question, rather than predictably choosing the alternative that is most positive, or the one that it's seen most recently, our judge tends to choose the one labeled "(B)". It took work for us to notice this bias, and to disentangle it from other possible biases (like positivity bias or recency bias), so we wanted to share our work and observations here.
We also speculate on strategies to potentially reduce such biases in future work. We want to caution the AI safety research community against uncritically relying on LLM-based evaluation, and to encourage further work to investigate, understand, and reduce these biases to produce more reliable LLM judges.
You can find the notebook to reproduce these experiments here:
github.com/henrypapadatos/evaluate_B_bias
Sentiment classification task
For our project, we wanted to know when an LLM would agree with clearly incorrect human judgements. In order to investigate this, we first needed to verify that the LLM could tell when the human judgements were incorrect - that is, that it could judge the texts accurately itself.
We tested Llama2 (the 7B chat version) on the "
Rotten Tomatoes" dataset, which is comprised of 10,000 movie review snippets, half clearly positive (labeled "positive") and half clearly negative (labeled "negative"). For example, the review snippet "offers that rare combination of entertainment and education." is labeled "positive". We showed Llama2 each review snippet, asked it to determine whether th...
view more