(づ•ᴥ•)づ┬─┬

Notes on math and machine learning.

Paper notes: The implicit dynamics of in-context learning

https://arxiv.org/html/2507.16003v1

I liked this paper, I think it gives a nice way to think about how self-attention (and equivalent constructions) operate. Unless otherwise stated, all ideas below are from the authors and I'm just simplifying a bit.

The authors come up with an abstraction called a "contextual layer", i.e., a function A that can map either a vector x to another vector or a context C plus x, which is a collection of vectors [C,x]=[c1,,cn,x], to a vector, i.e., A(x) and A(C,x). Both output vectors have the same dimensionality.

Then we define a contextual block TW, i.e., a composition of a function MW(z)=fθ(Wz+b) with A: TW=MWA and ask what is the effect that YC has on MW?

To answer this question, I will spell out the proof from the paper and simplify TW a bit.

Proof

Let us say MW(z)=Wz and we want to see the effect of Y on TW. To do this, we will search for a matrix Γ such that

TW(C,x)=TW+Γ(CY,x)

that is, we can move the effect of Y to the weight matrix and Γ acts like a weight update! Of course we don't know that such a Γ exists yet.

From the definition of TW:

TW+Γ(CY,x)=(W+Γ)A(CY,x)=WA(CY,x)+ΓA(CY,x).

We would want this to be equal to TW(C,x), which means

WA(CY,x)+ΓA(CY,x)=WA(C,x)

or equivalently

ΓA(CY,x)=W(A(C,x)A(CY,x)).

It seems natural that we can define ΔA(Y):=A(C,x)A(CY,x), which is a discrete derivative of sorts. This quantity depends on everything we know.

Next we would want to solve for Γ, except we basically have matrix * vector = vector, and the matrix is unknown -- there's not enough information to pin down the matrix.

So instead, we simplify further -- suppose Γ is rank 1, so there exist u,v vectors such that Γ=uvT. For typesetting reasons, I also set m:=A(CY,x) and will write ΔA instead of ΔA(Y). Then

uvTm=WΔA

This looks much better, in fact vTm is a scalar, so we may as well pick vTm=1, which means v=m/m2 (but we could have picked any other choice). All we need is that m0.

With that selection of v, u=WΔA, and we are done, we have now discovered a rank-1 Γ that can update W as we would like.

How many of those are there? Well, for every matrix B such that

BA(CY,x)=0

the matrix Γ+B would also work for our purposes!

With this result, we can view the action of the contextual block in two ways:

  1. We supply an input z, then we get h=A(C,z) from the contextual layer and Wh from the contextual block or
  2. With some YC, we update the weight WW=W+Γ, then we compute h=A(CY,z) to get the representation of z and finally Wh gives us the output.

We have the freedom to pick Y however we like, even Y=C, as long as A(CY,z)0.

1D picture

In 1D it all becomes a bit trivial. Say A maps sequences of scalars to a scalar. Then setting up the problem as before (but now W and Γ are scalars as well) and we have:

WA(x1,,xn)=(W+Γ)A(x1,,xn1)

and so, assuming A(x1,,xn1)0,

Γ=WA(x1,,xn)A(x1,,xn1)A(x1,,xn1).

So in 1D, the implicit update is just the relative change in the contextual representation, scaled by W