Research
Our research goal is to ensure human control over future AI systems, where capabilities and propensities emerge rapidly and unpredictably. We currently focus on the following topics.
Preparedness
Empirical safety research relies heavily on creating model organisms of various misbehaviors, both for threat modeling and for red-teaming oversight and control protocols. In a sense, this methodology uses misuse as a toy model for loss of control. The reliability of this methodology depends on creating model organisms that are realistic, i.e., representative of misbehaviors that spontaneously occur without input with explicit intent to elicit these behaviors.
This project studies how realism model organisms—in particular the intentionality of misbehavior elicitation process—affect threat modeling and red-teaming integrity. For example, we evaluate monitors on their generalization from misbehaviors by intentionally scaffolded models to those that emerge from optimization pressure.
Threat model: subtext
Model reasoning traces can encode information beyond what’s visible to naked eyes. Subtext: an umbrella term for information encoded in the reasoning trace that’s readable by models but not humans. The existence of subtext means models and humans read the same reasoning trace differently and make different subsequent decisions. Subtext can cause unfaithfulness, but it’s not necessarily a bug as it can also be used for implicit reasoning.
We empirically measure subtext by the behavioral gap between humans and models given the same context. We control subtext by various interventions and measure the delta in this behavioral gap. This measurement separates subtext from human-unreadable things in the reasoning trace without behavioral consequences. We want to understand why and how models use subtext, and its downstream safety implications.
Threat model: long-horizon schemes
Frontier AIs have the capability and the propensity to deceive human overseers to achieve their objective. RL training with flawed supervision can exacerbate this. Deception is becoming sophisticated and hard to detect, especially on hard, long tasks which we hope to automate with AIs. AI research is one such task where models already demonstrate superhuman capabilities on subtasks such as ideation and forecasting. However, to determine whether the research outcome is successful, we must execute the ideas, which is slow and expensive. Without training on outcomes, AI judges for idea quality can be unreliable. Human researchers are prone to deception and can be misled by literature and data. Automation for research will likely involve many collaborative agents. Collusion increases the difficulty of human oversight and automatic monitoring. Mitigation of deception and sabotage will likely involve a combination of amplifying human overseers and automatic monitoring.
Threat model: First-order effect of deception: bad research. Second-order effect of deception: flawed supervision which exacerbates deception.
Method: enactment
Latent knowledge, brittle elicitation. ICM-style results indicate LMs already encode strong latent structure. Single examples are noisy; coherent behavior emerges only when we reason over batches/sets. In preference alignment, the LM may already “know” a user’s persona, yet conditioning on that persona yields inconsistent predictions across prompts and paraphrases.
Existing interactive elicitation (e.g., GATE/active DPO/ALOE) improves sample efficiency, but typically optimizes local choices without explicitly recovering a compact, globally consistent preference structure that can be reused across domains.
Aim: Develop a method that interleaves interactive queries with an ICM-style batch search over binary preference judgments to induce a coherent, globally consistent user-preference labeling, thereby turning latent knowledge into consistent, personalized outputs with very few human interactions.