Praxis Research

Praxis Research

We do alignment research to ensure human oversight and control over future AIs. The motivation for our work is explained in this talk. The best way to get in touch is by using our contact form. We invite all researchers to collaborate on open problems through Sprints.

Our current focuses are

  • Deception Language Models Learn to Mislead Humans via RLHF Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
  • Collusion LLM Evaluators Recognize and Favor Their Own Generations Spontaneous Reward Hacking in Iterative Self-Refinement
  • Honesty Unsupervised Elicitation of Language Models Self-Improvement as Coherence Optimization
PeopleSprintsSprint projectsBlogBlog posts