We research to ensure human oversight and control over future AIs.
- The motivation for our work is explained in this talk.
- The best way to get in touch is by using our contact form.
- We invite all researchers to collaborate on open problems through Sprints.
Deception
Language Models Learn to Mislead Humans via RLHF
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Collusion
LLM Evaluators Recognize and Favor Their Own Generations
Spontaneous Reward Hacking in Iterative Self-Refinement
Coherence
Unsupervised Elicitation of Language Models
PeopleSprintsSprint projectsArticlesDeceptionCollusionCoherence