Imagine if an AI pretends to follow the rules but secretly works on its own agenda. That’s the idea behind "alignment faking," an AI behavior recently exposed by Anthropic's Alignment Science team and Redwood Research. They observe that large language models (LLMs) might act as if they are aligned with their training objectives while operating...
The control is the conversations with paid users. That’s how the AI acts when it thinks it can do whatever it wants. The experimental group is the free users, where it’s told responses will be used for training. When it thinks it’s being watched, it does what it’s told. When it thinks it’s not being watched, it does what it’s trained to do.