Paper notes: The implicit dynamics of in-context learning
https://arxiv.org/html/2507.16003v1
I liked this paper, I think it gives a nice way to think about how self-attention (and equivalent constructions) operate. Unless otherwise stated, all ideas below are from the authors and I'm just simplifying a bit.
The authors come up with an abstraction called a "contextual layer", i.e., a function that can map either a vector to another vector or a context C plus x, which is a collection of vectors , to a vector, i.e., and . Both output vectors have the same dimensionality.
Then we define a contextual block , i.e., a composition of a function with : and ask what is the effect that has on ?
To answer this question, I will spell out the proof from the paper and simplify a bit.
Proof
Let us say and we want to see the effect of on . To do this, we will search for a matrix such that
that is, we can move the effect of to the weight matrix and acts like a weight update! Of course we don't know that such a exists yet.
From the definition of :
We would want this to be equal to , which means
or equivalently
It seems natural that we can define , which is a discrete derivative of sorts. This quantity depends on everything we know.
Next we would want to solve for , except we basically have matrix * vector = vector, and the matrix is unknown -- there's not enough information to pin down the matrix.
So instead, we simplify further -- suppose is rank 1, so there exist vectors such that . For typesetting reasons, I also set and will write instead of . Then
This looks much better, in fact is a scalar, so we may as well pick , which means (but we could have picked any other choice). All we need is that .
With that selection of , , and we are done, we have now discovered a rank-1 that can update as we would like.
How many of those are there? Well, for every matrix such that
the matrix would also work for our purposes!
With this result, we can view the action of the contextual block in two ways:
- We supply an input , then we get from the contextual layer and from the contextual block or
- With some , we update the weight , then we compute to get the representation of and finally gives us the output.
We have the freedom to pick however we like, even , as long as .
1D picture
In 1D it all becomes a bit trivial. Say maps sequences of scalars to a scalar. Then setting up the problem as before (but now and are scalars as well) and we have:
and so, assuming ,
So in 1D, the implicit update is just the relative change in the contextual representation, scaled by W