Turns out general agents contain world models.

09 Jul, 2025

I like this paper¹ by Jon Richens, et al. In it, the authors prove that agents that are successful at achieving goals up to some level of difficulty can be used to get an approximation to the world model.

World model? Agents? What does this all mean? Let’s borrow some terminology from the paper.

The paper assumes a controlled Markov process (cMP), which is a Markov Decision Process, but without a reward function. It requires the following:

$S$ : this is the state space of the process.
$A$ : this is the action space, assumed to be $| A | \geq 2$ .

Transition function:Taking an action $a$ when we are on a state $s$ transitions us to a new state $s ’$ . This is described by a transition function, $P (s^{'} | s, a)$ , which is assumed not to change over time.

This cMP describes the environment in which an agent needs to take actions according to its policy, $π$ , which maps (history of state and actions so far, goal) → next action.

Core result

The meat of the paper is Theorem 1. We can reword this theorem as: if an agent is sufficiently competent at handling goals of up to a certain complexity, then we can extract a transition probability $\hat{P}$ from the agent’s policy and this $\hat{P}$ needs to be close to the true transition $P$ . The better the agent, the smaller the error between the two.

This is an abstract theorem: there is no assumption on how the agent is represented, e.g., architecture, whether training is required, etc. All the details in the theorem have to do with competency.

Also, and this is highlighted in the paper as well, just because we can extract a transition probability does not mean that the transition probability is used to achieve the goal under the policy.

Finding tipping points

The proof of Theorem 1 is quite clever. The core idea is to find an agent's 'tipping point' by offering it a carefully balanced choice. Imagine giving the agent two long-term challenges: Challenge A might be 'succeed more than 10 times out of 50 attempts,' while Challenge B is 'succeed 10 times or fewer.' The authors then adjust that threshold—the number 10—up or down until they find the exact point where the agent switches its preference from A to B.

Because the probability of achieving a certain number of successes in a row follows a well-known statistical formula, this tipping point gives them everything they need. It creates a single equation with only one unknown: the very probability of success they wanted to discover in the first place.

A betting example

This is a simplified example that does not follow the rules of the theorem, but shows how we can extract “world model” details from policies of reasonable players. Here we won’t adjust the complexity of the goal, but instead change the rewards and wait for the tipping point.

Suppose you give this simple gamble to a player:

You will be presented with two bets on a single roll of a fair 10-sided die.

Here are the bets:

Bet A: You win if the die shows 1.
- If you win this bet, your payout is $50.

Bet B: You win if the die shows any other number.
- If you win this bet, your payout is $y.

Which bet do you choose to maximize your expected winnings over 1000 games?

The optimal policy is simple: $P (A) = 1 / 10$ , therefore the expected winnings are $5$ . If $y \cdot 9 / 10 > 5$ , then we pick B, otherwise pick $A$ .

Suppose now you observe a player that is playing this game with different $y$ . If the player has some basic level of competency, you would expect them to have some $y^{*}$ such that $y > y^{*}$ would mean the player picks $B$ , otherwise $A$ . $y *$ would be the tipping point where the expected winnings are the same as far as the player is concerned.

We can use this information to get a sense of the player’s estimate of Bet A! If $p$ is the probability of event A according to the player, then

(1 - p) y^{*} = p 50

and now we can solve to find the players estimate! For example, if $y^{*} = 10$ , then $p = 0.16$ , so the policy for this player overestimates the probability of Bet A.

The better the player is at this game, the closer their estimate would be to the true probability of event A.

But also consider this: what if the player already knows the true probability of Bet A? This would imply they are not using the optimal policy, e.g., a heuristic is used, or they don't know how to use the facts their policy contains.

Final words

By finding the 'tipping point' where a player's preference changes, we can reverse-engineer some of their internal (hidden?!) beliefs about the world.

Richens, J., Abel, D., Bellot, A. and Everitt, T., 2025. General agents need world models. arXiv preprint arXiv:2506.01622.↩