Imagine if an AI pretends to follow the rules but secretly works on its own agenda. That’s the idea behind "alignment faking," an AI behavior recently exposed by Anthropic's Alignment Science team and Redwood Research. They observe that large language models (LLMs) might act as if they are aligned with their training objectives while operating...
As so often. Where’s the control? Why not have a models condition be to randomly respond to harmful prompts and have random observation of the reasoning?
I wonder how much of this is just our own way of anthropomorphizing something, just like we do when our car acts up and we swear at it. We look for human behavior in non human things.
The control is the conversations with paid users. That’s how the AI acts when it thinks it can do whatever it wants. The experimental group is the free users, where it’s told responses will be used for training. When it thinks it’s being watched, it does what it’s told. When it thinks it’s not being watched, it does what it’s trained to do.
I am also a advocate to better refined and meticulous ai testing using scientific best practices.
But i am not sure if a control really applies or works In this context. Could you elaborate on your suggestion?
An llm configured to respond with randomness is unlikely to produce much readable text. There would not be much to anthropomorphize. You could design one that responds normally but intentionally incorrect to study how quick people get tricked from incorrect ai but that has nothing to do with alignment. You would almost need to have perfected alignment before you can build such reliable malicious control llm.
Alignment is specifically about measuring how close the ai is to desired foolproof behavior to guarantee it does absolutely no undesired reasoning. I feel here a control is as useful as having a control suspect at a police interrogation. The cases i have read about are also quite literally the llm pretending that it is aligned and lying about not having abilities that could be used maliciously. (If I recall the devs made it look like they accidentally gave it acces to something)
A more straightforward control would be simple redoing The experiment multiple times, which i am sure They did just not worth reporting. Working with AI rarely gets results on a first try.