Deception

Deception

How can we prepared for models that can scheme against us? Well, before we encounter them in the wild, we can create scheming models in the lab, using controlled environments where we exaggerate the factors that can contribute to scheming, such as flaws in human supervision. This approach, sometimes known as “model organisms”, is used heavily in safety evaluations and the development of defenses such as scalable oversight and control protocols.

These lab-made schemers are not what we are really concerned about—the preparedness we want is for real schemers, which will likely emerge despite our best effort to align them, without the exaggeration that we rely on to create lab-made schemers. If the lab-made schemers are nothing like the real schemers, for example because we over-exaggerated or focused on the wrong thing, then even the best defense we can find in the lab is probably useless against real future schemers.

Realism of these controlled environment where we create lab-made schemers thus seems like a pretty important factor, but there will always be a gap: if we can 100% foresee what future schemers are like, the problem solves itself. So we need the defense to generalize from lab environments to real-world environments, similar to how self-driving cars trained in simulators drive well on the road.

We can collect empirical evidence for whether our defenses can generalize across this gap, i.e., achieve what we call “realism extrapolation”. We are working on infrastructure and experiments to evaluate it for mainstream defense methods.

Recent papers