Rubi Hudson
2026
We show how goals can be modified to remove instrumental incentives for preventing goal updates, including shutdown.

An AI agent will learn a desired goal more effectively if it does not resist the training process, but many partially learned goals incentivize an AI to avoid further goal updates. We would like goals to be corrigible, meaning they allow requested changes, so that we can confidently correct errors and shut down the AI if necessary. Despite this being a crucial safety property, the existing literature does not specify goals that are both corrigible and competitive with alternatives. We introduce a transformation that constructs a corrigible version of nearly any goal, without sacrificing performance. This is done by eliciting predictions of reward conditional on costlessly preventing updates, and having that target be pursued myopically. These goals are then shown to lead to optimal performance among the class of corrigible goals, incentivize allowing mid-action overrides, disincentivize deliberate self-modification, and induce corrigible behavior in gridworld settings.

Disclaimer: This paper is currently under review, and will be updated shortly to incorporate the helpful suggestions made by reviewers.