Praxis Research

We do alignment research to ensure human oversight and control over future AIs. The motivation for our work is explained in this talk. The best way to get in touch is by using our contact form. We invite all researchers to collaborate on open problems through Sprints.

Our current focuses are

Deception Language Models Learn to Mislead Humans via RLHF Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Collusion LLM Evaluators Recognize and Favor Their Own Generations Spontaneous Reward Hacking in Iterative Self-Refinement
Honesty Unsupervised Elicitation of Language Models Self-Improvement as Coherence Optimization

People Sprints Sprint projects Blog Blog posts