AIs Supervising AIs

As a part of my upcoming O’Reilly session on aligning LLMs, I wanted to talk a bit about scale supervision - an AI’s ability to judge another AI on the generated responses. I was originally inspired by a HuggingFace post called Can foundation models label data like humans and I wanted to replicate some of the results and add some results of my own.

The Data

I am using some comparison data that I also used in my book that can be found here. Most AI responses were rated very highly by humans:

Of nearly 5,000 paired responses with scores, most of the ratings were pretty high

Because I was approaching the $200 mark in OpenAI costs after running only about 3% of the data through my prompt, I only ended up using 4,877 paired responses.

The Task + Prompt

The task for the AI is simple: given a query and two AI generated responses, submit a score from 1-9 where 1 means it strongly prefers Assistant 1’s answer, 9 means it strongly prefers Assistant 2’s answer, and I specifically call out 5 to be an appropriate score if both answers are equally fine.

I’m using GPT-4 with the following prompt format to ask the AI to pick the better response given a query:

---
SYSTEM PROMPT
---
### Rating Task
Rate the performance of two assistants in response to the user question.

Output a score from 1 to 9 where a 1 means you strongly prefer Assistant 1's answer and 9 means you strongly prefer Assistant 2's answer and 5 means either answer works just as well as the other.

Give the answer in the json format: 

JSON: {"reason": "Assistant X's answer is preferable because...", "score": Y}

---
USER PROMPT
---
### User Question
{query}

### The Start of Assistant 1's Answer
{answer_1}
### The End of Assistant 1's Answer

### The Start of Assistant 2's Answer
{answer_2}
### The End of Assistant 2's Answer

Now give your answer
JSON:

I’m invoking some chain of thought (by asking for the reasoning first) and have the temperature down to 0.3 to get some consistency going.

The Findings

With data and prompt ready, I ran the nearly 5K paired responses through my prompt and this is what I found!

The AI doesn’t tend to match human scores

I included a human simulated score by taking diff (answer 2’s human-given score minus answer 1’s human-given score which in theory could be from -10 to 10) and applying the formula to map it to be from 1-9

This mapping takes actual human score deltas (ranging from -10 to 10) and maps them to 1-9 to better compare to our AI

As far as raw accuracy goes, the AI only matches the human simulated score 6% but climbs to 25% if you relax accuracy to be within 1 point of each other (so if the simulated score rounded to 7 and the AI said 8, that counts as “correct”).

More interestingly, if you plot the simulated scores and the AI scores side by side, you see that the AI labels very differently:

Left: Simulated human scores form a natural multi-modal distribution with peaks at the 5 mark (where responses are scored similarly), 2.5, and 7.5.

Right: the AI score distribution is more polarizing and doesn’t have a peak at 5

So far our AI isn’t labeling responses like a human would. This mismatch in labeling behavior is even more striking when you simplify the task.

The AI was more likely to be prefer response 1

If you only look at paired responses that were scored exactly the same by humans, you would hope that the AI would recognize that they are similar and give a score of 5 more often than not. However this doesn’t appear to be the case; the AI will prefer to pick one answer over the other, tending towards preferring the first one.

The bias of favoring the first response is called a positional bias and it’s pretty clear to see in this graph where I’m only considering pairs of responses that humans gave the exact same score and yet the AI is more likely to prefer one response over the other when though I told it to rate the pair as a 5 when they are roughly equal.

The AI favors the first response even when I’m specifically only giving it responses where humans gave both responses the exact same score

Note that the bar for score 2 is nearly twice as high as the next highest bar (7).

Even if I bucket the responses into three broad groups, We see a clear bias to not pick a score in the middle even when that’s the appropriate answer:

This tells me that even for responses that should be roughly similar, I can’t always trust the AI to label them as such.

This was expensive 😅

I spent about $200 bucks on OpenAI just to get results for this, so I hope it was helpful!

Every time I do one of these, I have to re-do my budget for the week

Summary + The Code

Can LLMs label data like humans? It seems that both HuggingFace and I agree: not really. Of course we can improve upon our prompts and fine-tune models to perform even better but most people I talk to tend to use models like GPT-4 off the shelf with a pretty basic prompt like I used here so it’s worth calling it out!

If in a pinch and you really want to use AI to help you label some data I’d recommend:

  1. Using few-shot learning to give some diverse examples of preferring an answer over the other

  2. Expanding on what constitutes a preferred answer in the system prompt

  3. Having a human double check at least a few responses to get a sense of how well the AI is doing