Agenda - Principled Agents

AI alignment is an open problem, but the main issues that need to be addressed are known. In the same way that mathematicians can plan out the structure of a proof before they know how to show each part, we can outline the steps of a plan for alignment while still leaving many details to be filled in:

Have the AI learn its goal from human feedback on outcomes
Avoid goal misgeneralizations with continual learning
Keep the AI corrigible throughout the training process
Minimize harm done by the AI as it learns

The aim of this framework is to reach a basin of attraction around alignment, at which point an AI prefers becoming more aligned to pursuing misaligned goals. As the size of this basin is unknown, progress on any issue can help on the margin.

This approach has the advantage of building off the current methods for aligning frontier systems, making it more feasible to implement progress made. Where our research differs from existing efforts are the problems we focus on, and how we try to solve them.

1. Human approval, learned from preference data

The original problem of AI alignment was specifying a goal that is safe for an AI to optimize. The ethos of modern machine learning says that we don't need to specify a goal, but rather can have the AI learn it from examples.

Two immediate problems arise. The first is that the AI can learn to manipulate human feedback or the infrastructure through which it is processed. The second is that human feedback can be mistaken in complex situations, teaching the AI to pursue what seems good instead of what is good.

To address the first issue, we aim to define classes of goals that do not incentivize tampering with the way they are evaluated. This could involve variants on methods like Myopic Optimization with Non-Myopic Approval (MONA), where AI is trained on predicted approval, so that evaluations are made before actions can be taken to influence them.

For the second issue, known as Eliciting Latent Knowledge (ELK), we wish to describe incentives for the AI to provide accurate explanations, even when humans cannot evaluate accuracy. This includes approaches like debate, which assigns AI to opposite sides of an argument under the assumption that arguing for the truth is favored to win.

2. Convergence through continued learning

Preference data can admit multiple generalizations, to learn the one we intend, further data can be provided as new situations are encountered. With enough evaluations of diverse outcomes, the correct generalization can be identified and implemented.

In addition to learning from experience, we can also try to teach the AI proactively, expanding on approaches like Diversify and Disambiguate. We would like to construct incentives for AI to suggest either sets of goals it is uncertain between, or outcomes it is unclear how to evaluate, focusing on where it will gain the most information from feedback.

3. Acceptance of the training process

When updating the AI's goal with ongoing feedback, the incentive can emerge to actively disrupt the training process. For the AI's goal at a particular point in time, that goal is often better served by the AI continuing to pursue it, which means avoiding further goal updates. For an AI aware that it is being trained, this can even occur before initial deployment.

To address this issue, our aim is to define goals that are corrigible, meaning they do not incentivize resisting updates. This includes building off our previous work on the Corrigibility Transformation, which modifies goals to be corrigible by having the AI take actions that it predicts would be good if it ignored future updates.

4. Minimize harm while learning

An AI that will eventually learn the goal humans intend can still cause significant damage during that process. To minimize that downside, we can try to give the AI goals that incentivize conservative agency, taking actions that are good according to many possible goals.

Work here could include developing metrics for concepts like "reversibility" and "low-impact", then use them in training the AI to pursue its goal with safer methods. It can also include approaches like Attainable Utility Preservation (AUP) and pessimism, which have the AI consider a large set of goals in decision making.

One way that an AI could cause harm while still learning is by making commitments that bind itself in the future, such as to always ignore threats. Understanding the incentives for making commitments, or adopting decision theories that act as though they made commitments, will require game theoretic modelling of the upside and costs under different goal structures.

5. Empirical validation

Our primary approach to solving problems in AI alignment is through defining goals that incentivize the desired behaviors and disincentivize unwanted ones. An advantage of this principled approach is in making training more consistent, helping models learn faster and internalize deeper. Formal definitions of the goals we wish for AI to pursue will also provide an objective metric by which we can evaluate automated AI safety research.

While aimed at aligning AGI, the generality of theory means our work should also be useful for aligning current systems and precursors to AGI. Experimentation is therefore crucial for validating our work, and an invaluable source of feedback. Our work must be rigorously tested empirically, showing that it can be trusted at any scale.