Paper notes: The implicit dynamics of in-context learning

28 May, 2026

I liked this paper, I think it gives a nice way to think about how self-attention (and equivalent constructions) operate. Unless otherwise stated, all ideas below are from the authors and I'm just simplifying a bit.

The authors come up with an abstraction called a "contextual layer", i.e., a function $A$ that can map either a vector $x$ to another vector or a context C plus x, which is a collection of vectors $[C, x] = [c_{1}, \dots, c_{n}, x]$ , to a vector, i.e., $A (x)$ and $A (C, x)$ . Both output vectors have the same dimensionality.

Then we define a contextual block $T_{W}$ , i.e., a composition of a function $M_{W} (z) = f_{θ} (W z + b)$ with $A$ : $T_{W} = M_{W} \circ A$ and ask what is the effect that $Y \subseteq C$ has on $M_{W} (z)$ ?

To answer this question, I will spell out the proof from the paper and simplify $T_{W}$ a bit.

Proof

Let us say $M_{W} (z) = W z$ and we want to see the effect of $Y$ on $T_{W}$ . To do this, we will search for a matrix $Γ$ such that

$T_{W} (C, x) = T_{W + Γ} (C ⧵ Y, x)$

that is, we can move the effect of $Y$ to the weight matrix and $Γ$ acts like a weight update! Of course we don't know that such a $Γ$ exists yet.

From the definition of $T_{W}$ :

T_{W + Γ} (C ⧵ Y, x) = (W + Γ) A (C ⧵ Y, x) = W A (C ⧵ Y, x) + Γ A (C ⧵ Y, x) .

We would want this to be equal to $T_{W} (C, x)$ , which means

W A (C ⧵ Y, x) + Γ A (C ⧵ Y, x) = W A (C, x)

or equivalently

Γ A (C ⧵ Y, x) = W (A (C, x) - A (C ⧵ Y, x)) .

It seems natural that we can define $Δ A (Y) : = A (C, x) - A (C ⧵ Y, x)$ , which is a discrete derivative of sorts. This quantity depends on everything we know.

Next we would want to solve for $Γ$ , except we basically have matrix * vector = vector, and the matrix is unknown -- there's not enough information to pin down the matrix.

So instead, we simplify further -- suppose $Γ$ is rank 1, so there exist $u, v$ vectors such that $Γ = u v^{T}$ . For typesetting reasons, I also set $m : = A (C ⧵ Y, x)$ and will write $Δ A$ instead of $Δ A (Y)$ . Then

u v^{T} m = W Δ A

This looks much better, in fact $v^{T} m$ is a scalar, so we may as well pick $v^{T} m = 1$ , which means $v = m / ‖ m ‖^{2}$ (but we could have picked any other choice). All we need is that $‖ m ‖ \neq 0$ .

With that selection of $v$ , $u = W Δ A$ , and we are done, we have now discovered a rank-1 $Γ$ that can update $W$ as we would like. $◻$

How many of those are there? Well, for every matrix $B$ such that

B A (C ⧵ Y, x) = 0

the matrix $Γ + B$ would also work for our purposes!

With this result, we can view the action of the contextual block in two ways:

We supply an input $z$ , then we get $h = A (C, z)$ from the contextual layer and $W h$ from the contextual block or
With some $Y \subseteq C$ , we update the weight $W \to W^{'} = W + Γ$ , then we compute $h^{'} = A (C ⧵ Y, z)$ to get the representation of $z$ and finally $W^{'} h^{'}$ gives us the output.

We have the freedom to pick $Y$ however we like, even $Y = C$ , as long as $A (C ⧵ Y, z) \neq 0$ .

1D picture

In 1D it all becomes a bit trivial. Say $A$ maps sequences of scalars to a scalar. Then setting up the problem as before (but now $W$ and $Γ$ are scalars as well) and we have:

W A (x_{1}, \dots, x_{n}) = (W + Γ) A (x_{1}, \dots, x_{n - 1})

and so, assuming $A (x_{1}, \dots, x_{n - 1}) \neq 0$ ,

Γ = W \frac{A (x_{1}, \dots, x_{n}) - A (x_{1}, \dots, x_{n - 1})}{A (x_{1}, \dots, x_{n - 1})} .

So in 1D, the implicit update is just the relative change in the contextual representation, scaled by W