(づ•ᴥ•)づ┬─┬

Notes

Model Merging and the Geometry of Optimization

TL;DR: Here I’m revisiting older notes on model merging. Gradient descent steps can be viewed as model merges, and model merges can often be interpreted as gradient steps. Making this connection explicit shows that smooth merging methods recover pre-conditioned gradient descent, while linear merging implicitly assumes a Euclidean geometry. When models are trained with different optimizers or pre-conditioners, this assumption can fail, helping explain when naive averaging works and when geometry-aware merging methods are needed.


Gradient steps as model merging

Consider a model θ0 that we wish to optimize with respect to a loss L over a dataset B. We will use batch gradient descent (GD) to carry out the optimization, here's the first step of this given some η>0:

θ1=θ0ηLB(θ0).

We can massage this equation to get

θ1=θ0ηLB(θ0)=(1η)θ0+η(θ0LB(θ0)).

Thus we wrote the GD step as a linear-model-merge (often called LERP) between the original model θ0 and a unit-step local model θL:=(θ0LB(θ0)) (“local” to the dataset/batch we use, which is different from the local models from federated learning1). We expect that θL has a smaller loss than θ02.

Local-model substitution: Using θA:=θ and θB:=θLB(θ) (or θB(η):=θηLB(θ)) as the models in a merging method.

It’s reasonable to try to work backwards from a model merging method to recover an optimization method by using the local-model substitution discussed here to see what optimization methods pop out. For example,

For smooth model-merging methods, the θA, θB substitution never gives surprising results, in some sense:

Lemma 1 (Smooth merging functions recover GD steps with small step-size)

Let f:n×nn be a C1 “merging” function, mapping (θa,θb)θ, and assume:

  1. Consistency: for all θ, f(θ,θ)=θ.

  2. Jacobian at the diagonal: the partial derivative w.r.t. the second argument exists and

    fθb|θb=θ=A

    for some matrix A.

Define the update

θn+1:=f(θn,θnηL(θn)).

Then, as η0,

θn+1=f(θn,θnηL(θn))=θnηAL(θn)+O(η2).

So a smooth merging function recovers a pre-conditioned GD method in the small η limit.

Proof

Since f is C1, we can Taylor expand in the second argument around η=0 (i.e., around θb=θn):

f(θn,θnηL(θn))=f(θn,θn)+fθb|θb=θn(ηL(θn))+O(η2).

Using the assumptions f(θn,θn)=θn and fθb|θb=θn=A, we obtain

f(θn,θnηL(θn))=θnηAL(θn)+O(η2),

which concludes the proof.

When η0, we have θnθnηL(θn), so the two “models” being merged are extremely close in parameter space. This regime is not representative of most practical model merging settings, where models are typically separated by many optimization steps.

Next we look at the opposite direction: from model merging to gradient steps.

Linear model merging as gradient steps

Consider two models θ1 and θ2 derived from a common initialization θ0 and optimized on different loss functions L1 and L2 respectively using gradient descent. Therefore, we get that, for θ2, there exists a point θ¯2 and learning rate η such that:

θ2=θ¯2ηL2(θ¯2).

Those4 have to exist as they represent the last optimization step that resulted in θ2.

Now, for a linear model-merge with interpolation parameter a[0,1]:

θ*=aθ1+(1a)θ2.

By substituting the gradient descent representation of θ2:

θ*=aθ1+(1a)(θ¯2ηL2(θ¯2))=[aθ1+(1a)θ¯2](1a)ηL2(θ¯2)

Thus the merged model θ* can be seen as a gradient descent step where:

So linear model merging is taking a partial gradient step from an interpolated position in parameter space, with the size of the step modulated by the merge coefficient a.

Pasted image 20251217111412

Riemannian geometry and model merging

Lemma 1 says that smooth model merging methods, when used with the local-model substitution we discussed earlier, behave like pre-conditioned gradient descent. In light of all this, I think we can sense-check linear model merging when used with models that optimize under different optimizers.

Suppose we have θ(1),θ(2) representing the model parameters of two models trained with the losses L1,L2 and different optimization schemes, which we will assume are:

θn+1(k)=θn(k)ηkPkLk(θn(k)),

where k=1,2 and Pk are pre-conditioners, P1P2, symmetric and positive-definite, ηk>0, and we assume those are converging to θ*(1) and θ*(2).

Without getting too deeply into Riemannian geometry, the existence of pre-conditioner Pk in gradient descent (GD) means that each corresponding GD makes steps in a geometry with metric Gk=Pk1. We can then measure the distance of a θ from θ*(k) by using

Jk(θ):=(θθ*(k))TGk(θθ*(k))=θθ*(k)Gk2,

which uses the metric that each GD is optimizing in. For example, if Gk=diag(1,100), the corresponding Jk would penalize harshly deviations in the second dimension.

Given this, we would want a successful merge to be close to

θmergeθopt=argminθJmerge(θ)=argminθ{J1(θ)+J2(θ)}.

It turns out that we can find θopt exactly as everything is linear:

θopt=(G1+G2)1(G1θ*(1)+G2θ*(2)).

Toy example

Now, we assume G1=diag(100,1) and G2=diag(1,100), θ*(1)=(0,0) and θ*(2)=(1,1).

We can also compute the errors according to Jmerge.

Jmerge(θlin)=50.5.

and

Jmerge(θopt)=1.98.

As expected, the optimal merge has much smaller error compared to the naive merge because the naive merge does not account for the geometry.

This toy example illustrates a general failure mode of naive linear merging. When models are trained with different pre-conditioners (or even different optimizers!), they implicitly optimize under different geometries. Linear interpolation assumes a shared Euclidean geometry and therefore computes the wrong midpoint. The resulting merged model can be arbitrarily far from optimal5.

Notably, several existing merging methods can be understood precisely as attempts to respect this geometry, e.g., Fisher merging6, Ties-merging, etc.

Footnotes

  1. Sebastian U. Stich. Local sgd converges fast and communicates little, 2019. URL https://arxiv.org/abs/1805.09767.

  2. From this POV, the optimization can be seen as an incremental merge of multiple local models with the initial (randomly initialized) θ0, that is:θn+1=(1η)·θn+η·θL(θn,Bn), which in turn is like an exponential moving average (EMA) of n+1 models, Bn representing the n-th batch (it's not exactly an EMA because the term θL(θn,Bn) depends on the current state).

  3. Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging:Resolving interference when merging models. Advances in Neural Information Processing Systems, 36, 2024.

  4. There's nothing special about using θ2. We can make the same argument w.r.t θ1.

  5. To be clear, a failure to decrease Jmerge does not in general imply a failure to decrease the combined loss L1+L2. Such an implication only holds in a local regime, when we are sufficiently close to θ*, the losses are smooth, and the metric G approximates the local Hessian.

  6. Matena, M. S., & Raffel, C. A. (2022). Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems35, 17703-17716.