26.
Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction
Tianyi Alex Qiu, Micah Carroll, and Cameron Allen
2026
25.
Imperfect Recall and AI Delegation
Eric Olav Chen, Alexis Ghersengorin, and Sami Petersen
2024
24.
Jackpot! Alignment as a Maximal Lottery
Roberto-Rafael Maura-Rivero, Marc Lanctot, Francesco Visin, and Kate Larson
2025
23.
Conservative Agency via Attainable Utility Preservation
Alex Turner, Dylan Hadfield-Menell, and Prasad Tadepalli
2019
22.
The Shutdown Problem: Incomplete Preferences as a Solution
Elliott Thornley
2024
21.
Natural Selection of Artificial Intelligence
Jeffrey Ely and Balazs Szentes
2023
20.
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Caspar et al.
2023
19.
A Theory of Rule Development
Glenn Ellison and Richard Holden
2014
18.
Evolution of Preferences
Eddie Dekel, Jeffrey Ely, and Okan Yilankaya
2007
17.
Hidden Incentives for Auto-Induced Distributional Shift
David Krueger, Tegan Maharaj, and Jan Leike
2020
16.
Misspecification in Inverse Reinforcement Learning
Joar Skalse and Alessandro Abate
2022
15.
A Robust Bayesian Truth Serum for Small Populations
Jens Witkowski and David C. Parkes
2012
14.
Quantilizers: A Safer Alternative to Maximizers for Limited Optimization
Jessica Taylor
2016
Safety Considerations for Online Generative Modeling
Sam Marks
2022
13.
Safe Pareto Improvements for Delegated Game Playing
Caspar Oesterheld and Vince Conitzer
2021
12.
Functional Decision Theory: A New Theory of Instrumental Rationality
Eliezer Yudkowsky and Nate Soares
2017
11.
Learning to Communicate with Deep Multi-Agent Reinforcement Learning
Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson
2016
Emergent Cover Signaling in Adversarial Reference Games
Dhara Yu, Jesse Mu, and Noah Goodman
2022
10.
Getting Dynamic Implementation to Work
Yi-Chun Chen, Richard Holden, Takashi Kunimoto, Yifei Sun, and Tom Wilkening
2018
9.
Cooperation, Conflict, and Transformative Artificial Intelligence - A Research Agenda
Jesse Clifton
2020
8.
Investment Incentives in Truthful Approximation Mechanisms
Mohammad Akbarpour, Scott Kominers, Kevin Li, Shengwu Li, and Paul Milgrom
2020
7.
Corrigibility
Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong
2015
The Off Switch Game
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell
2016
6.
Fully General Online Imitation Learning
Michael Cohen, Marcus Hutter, and Neel Nanda
2021
5.
Model-Free Opponent Shaping
Chris Lu, Timon Willi, Christian Schroeder de Witt, and Jakob Foerster
2022
The Good Shepherd: An Oracle Agent for Mechanism Design
Jan Balaguer, Raphael Koster, Christopher Summerfield, and Andrea Tacchetti
2022
4.
Discovering Agents
Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, and Tom Everitt
2022
3.
Decision Scoring Rules (Extended Version)
Caspar Oesterheld and Vincent Conitzer
2020
2.
Risks from Learned Optimization in Advanced Machine Learning Systems
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant
2019
1.
The Principal-Agent Alignment Problem in Artificial Intelligence
Dylan Hadfield-Menell
2021
Incomplete Contracting and AI Alignment
Dylan Hadfield-Menell and Gillian Hadfield
2018