Model Merging and the Geometry of Optimization

22 Dec, 2025

TL;DR: Here I’m revisiting older notes on model merging. Gradient descent steps can be viewed as model merges, and model merges can often be interpreted as gradient steps. Making this connection explicit shows that smooth merging methods recover pre-conditioned gradient descent, while linear merging implicitly assumes a Euclidean geometry. When models are trained with different optimizers or pre-conditioners, this assumption can fail, helping explain when naive averaging works and when geometry-aware merging methods are needed.

Gradient steps as model merging

Consider a model $θ_{0}$ that we wish to optimize with respect to a loss $L$ over a dataset $B$ . We will use batch gradient descent (GD) to carry out the optimization, here's the first step of this given some $η > 0$ :

θ_{1} = θ_{0} - η \nabla L_{B} (θ_{0}) .

We can massage this equation to get

θ_{1} = θ_{0} - η \nabla L_{B} (θ_{0}) = (1 - η) θ_{0} + η (θ_{0} - \nabla L_{B} (θ_{0})) .

Thus we wrote the GD step as a linear-model-merge (often called LERP) between the original model $θ_{0}$ and a unit-step local model $θ_{L} : = (θ_{0} - \nabla L_{B} (θ_{0}))$ (“local” to the dataset/batch we use, which is different from the local models from federated learning¹). We expect that $θ_{L}$ has a smaller loss than $θ_{0}$ ².

Local-model substitution: Using $θ_{A} : = θ$ and $θ_{B} : = θ - \nabla L_{B} (θ)$ (or $θ_{B} (η) : = θ - η \nabla L_{B} (θ)$ ) as the models in a merging method.

It’s reasonable to try to work backwards from a model merging method to recover an optimization method by using the local-model substitution discussed here to see what optimization methods pop out. For example,

In the single-task case, TIES³ reduces to sparsified (top-k) gradient descent.
For small $η$ , a SLERP-merge of $θ_{A}$ and $θ_{B} (η)$ is behaving like regular gradient descent.

For smooth model-merging methods, the $θ_{A}, θ_{B}$ substitution never gives surprising results, in some sense:

Lemma 1 (Smooth merging functions recover GD steps with small step-size)

Let $f : ℝ^{n} \times ℝ^{n} \to ℝ^{n}$ be a $C^{1}$ “merging” function, mapping $(θ_{a}, θ_{b}) \mapsto θ^{'}$ , and assume:

Consistency: for all $θ$ , $f (θ, θ) = θ .$
Jacobian at the diagonal: the partial derivative w.r.t. the second argument exists and

${\frac{\partial f}{\partial θ_{b}} |}_{θ_{b} = θ} = A$

for some matrix $A$ .

Define the update

θ_{n + 1} : = f (θ_{n}, θ_{n} - η \nabla L (θ_{n})) .

Then, as $η \to 0$ ,

θ_{n + 1} = f (θ_{n}, θ_{n} - η \nabla L (θ_{n})) = θ_{n} - η A \nabla L (θ_{n}) + O (η^{2}) .

So a smooth merging function recovers a pre-conditioned GD method in the small $η$ limit.

Proof

Since $f$ is $C^{1}$ , we can Taylor expand in the second argument around $η = 0$ (i.e., around $θ_{b} = θ_{n}$ ):

f (θ_{n}, θ_{n} - η \nabla L (θ_{n})) = f (θ_{n}, θ_{n}) + {\frac{\partial f}{\partial θ_{b}} |}_{θ_{b} = θ_{n}} (- η \nabla L (θ_{n})) + O (η^{2}) .

Using the assumptions $f (θ_{n}, θ_{n}) = θ_{n}$ and ${\frac{\partial f}{\partial θ_{b}} |}_{θ_{b} = θ_{n}} = A$ , we obtain

f (θ_{n}, θ_{n} - η \nabla L (θ_{n})) = θ_{n} - η A \nabla L (θ_{n}) + O (η^{2}),

which concludes the proof. $◻$

When $η \approx 0$ , we have $θ_{n} \approx θ_{n} - η \nabla L (θ_{n})$ , so the two “models” being merged are extremely close in parameter space. This regime is not representative of most practical model merging settings, where models are typically separated by many optimization steps.

Next we look at the opposite direction: from model merging to gradient steps.

Linear model merging as gradient steps

Consider two models $θ_{1}$ and $θ_{2}$ derived from a common initialization $θ_{0}$ and optimized on different loss functions $L_{1}$ and $L_{2}$ respectively using gradient descent. Therefore, we get that, for $θ_{2}$ , there exists a point ${\bar{θ}}_{2}$ and learning rate $η$ such that:

θ_{2} = {\bar{θ}}_{2} - η \nabla L_{2} ({\bar{θ}}_{2}) .

Those⁴ have to exist as they represent the last optimization step that resulted in $θ_{2}$ .

Now, for a linear model-merge with interpolation parameter $a \in [0, 1]$ :

θ^{*} = a θ_{1} + (1 - a) θ_{2} .

By substituting the gradient descent representation of $θ_{2}$ :

θ^{*} = a θ_{1} + (1 - a) ({\bar{θ}}_{2} - η \nabla L_{2} ({\bar{θ}}_{2})) = [a θ_{1} + (1 - a) {\bar{θ}}_{2}] - (1 - a) η \nabla L_{2} ({\bar{θ}}_{2})

Thus the merged model $θ^{*}$ can be seen as a gradient descent step where:

The starting point is the weighted average $a θ_{1} + (1 - a) {\bar{θ}}_{2}$
The step is in the direction of $\nabla L_{2} ({\bar{θ}}_{2})$
The effective learning rate is $(1 - a) η$

So linear model merging is taking a partial gradient step from an interpolated position in parameter space, with the size of the step modulated by the merge coefficient $a$ .

Pasted image 20251217111412

Riemannian geometry and model merging

Lemma 1 says that smooth model merging methods, when used with the local-model substitution we discussed earlier, behave like pre-conditioned gradient descent. In light of all this, I think we can sense-check linear model merging when used with models that optimize under different optimizers.

Suppose we have $θ^{(1)}, θ^{(2)}$ representing the model parameters of two models trained with the losses $L_{1}, L_{2}$ and different optimization schemes, which we will assume are:

θ_{n + 1}^{(k)} = θ_{n}^{(k)} - η_{k} P_{k} \nabla L_{k} (θ_{n}^{(k)}),

where $k = 1, 2$ and $P_{k}$ are pre-conditioners, $P_{1} \neq P_{2}$ , symmetric and positive-definite, $η_{k} > 0$ , and we assume those are converging to $θ_{*}^{(1)}$ and $θ_{*}^{(2)}$ .

Without getting too deeply into Riemannian geometry, the existence of pre-conditioner $P_{k}$ in gradient descent (GD) means that each corresponding GD makes steps in a geometry with metric $G_{k} = P_{k}^{- 1}$ . We can then measure the distance of a $θ$ from $θ_{*}^{(k)}$ by using

J_{k} (θ) : = (θ - θ_{*}^{(k)})^{T} G_{k} (θ - θ_{*}^{(k)}) = ‖ θ - θ_{*}^{(k)} ‖_{G_{k}}^{2},

which uses the metric that each GD is optimizing in. For example, if $G_{k} = d i a g (1, 100)$ , the corresponding $J_{k}$ would penalize harshly deviations in the second dimension.

Given this, we would want a successful merge to be close to

θ_{merge} \approx θ_{o p t} = {a r g m i n}_{θ} J_{m e r g e} (θ) = {a r g m i n}_{θ} {J_{1} (θ) + J_{2} (θ)} .

It turns out that we can find $θ_{o p t}$ exactly as everything is linear:

θ_{o p t} = (G_{1} + G_{2})^{- 1} (G_{1} θ_{*}^{(1)} + G_{2} θ_{*}^{(2)}) .

Toy example

Now, we assume $G_{1} = d i a g (100, 1)$ and $G_{2} = d i a g (1, 100)$ , $θ_{*}^{(1)} = (0, 0)$ and $θ_{*}^{(2)} = (1, 1)$ .

Linear merge: $θ_{l i n} = 0.5 θ_{*}^{(1)} + 0.5 θ_{*}^{(2)} = (0.5, 0.5)$
Optimal merge: $θ_{o p t} \approx (0.0099, 0.9901) .$

We can also compute the errors according to $J_{m e r g e}$ .

J_{m e r g e} (θ_{l i n}) = 50.5 .

and

J_{m e r g e} (θ_{o p t}) = 1.98 .

As expected, the optimal merge has much smaller error compared to the naive merge because the naive merge does not account for the geometry.

This toy example illustrates a general failure mode of naive linear merging. When models are trained with different pre-conditioners (or even different optimizers!), they implicitly optimize under different geometries. Linear interpolation assumes a shared Euclidean geometry and therefore computes the wrong midpoint. The resulting merged model can be arbitrarily far from optimal⁵.

Notably, several existing merging methods can be understood precisely as attempts to respect this geometry, e.g., Fisher merging⁶, Ties-merging, etc.

Footnotes

Sebastian U. Stich. Local sgd converges fast and communicates little, 2019. URL https://arxiv.org/abs/1805.09767.↩
From this POV, the optimization can be seen as an incremental merge of multiple local models with the initial (randomly initialized) $θ_{0}$ , that is: $θ_{n + 1} = (1 - η) \cdot θ_{n} + η \cdot θ_{L} (θ_{n}, B_{n}),$ which in turn is like an exponential moving average (EMA) of $n + 1$ models, $B_{n}$ representing the $n$ -th batch (it's not exactly an EMA because the term $θ_{L} (θ_{n}, B_{n})$ depends on the current state).↩
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging:Resolving interference when merging models. Advances in Neural Information Processing Systems, 36, 2024.↩
There's nothing special about using $θ_{2}$ . We can make the same argument w.r.t $θ_{1}$ .↩
To be clear, a failure to decrease $J_{m e r g e}$ does not in general imply a failure to decrease the combined loss $L_{1} + L_{2}$ . Such an implication only holds in a local regime, when we are sufficiently close to $θ^{*}$ , the losses are smooth, and the metric $G$ approximates the local Hessian.↩
Matena, M. S., & Raffel, C. A. (2022). Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35, 17703-17716.↩