(づ•ᴥ•)づ┬─┬

Notes on math and machine learning.

Targeting the gradient flow

In this post, we will first summarize what we discussed in part 1 and part 2, then add a few extra thoughts.

Summary of the two posts

Part 1: we picked a function f: and built a simple vector field such that for any trajectory (x,y):>02, the trajectory is attracted to the graph of f, in the sense that if u(t):=y(t)f(x(t)), then limtu(t)=0. To do this, we defined the residual u:=yf(x) and forced it to satisfy the ODE u=ku, k>0 (which also gives an ODE for y). We also picked an arbitrary direction for x to move with x=1. We showed that the corresponding vector field F(x,y)=(x,y) works for our purposes!

We discussed that we can extend the construction to vector fields on xn as well as the case of k being a function of x instead of constant.

Finally, given an existing vector field F and the assumption u=ku, we showed that u and F satisfy a transport PDE:

F·u+ku=0.

Part 2: we applied some of these ideas to optimization. Starting from a smooth g: for which we want to find a minimum, we created various vector fields F such that the trajectories following them converge to the stationary points of g. We used a similar construction, setting x=g(x) and getting an ODE for y from u=yg(x) and u=ku.

We also discovered that we have some freedom in picking x, i.e., for any well-behaved B:2, setting x=g(x)+B(x,y)u will still give us convergent dynamics as long as we are not completely cancelling g(x) close to the stationary points. This opens up new possibilities, e.g., B can also be a stochastic function, etc.

Follow-up questions and thoughts

It is fun watching the trajectories skip over some of the basins of attraction, but can we actually get more interesting behaviors out of picking the right B?

We assumed u=kuu=exp(kt)u0, and so no trajectory will have non-trivial periodic orbits. If a trajectory did, the residual would also have to be periodic and that is impossible as exp(kt) is not a periodic function. The only way for periodicity to hold is if u(t)=0 for all t>t0. However, on the attracting manifold we are following GD which also doesn't have any non-trivial periodic orbits.

So, regardless of the choice of B (assuming it is well-behaved), these methods cannot produce sustained exploration. There is an exponential clock, and once it rings, exploration is over: we are back to following gradient descent. We can give more time for exploration by adjusting the decay, e.g., setting u=u0/(1+t).

We don't have to match g.

In part 2, we had the residual be u=yg(x) and u=ku (or, equivalently, have u satisfy our transport PDE). However, what if instead of targeting the graph of g, we target the gradient flow g directly?

To do this, we will work on the (x,v) space where v=x. We set u=v(g(x)), i.e., we are getting trajectories to match the gradient flow, g(x). Now

u=v+g(x)x=v+g(x)v.

If we now enforce u=ku (and do a few calculations) we get:

x+(k+g(x))x+kg(x)=0.

So the same residual-decay idea produces second-order optimization dynamics if we focus on gradient flow.

I was wondering if this is a known second-order ODE. It looks like a special case of the form studied by Alvarez1, et al., : x+(α+βg(x))x+g(x)=0.

There's potential to recover further ODEs from the literature (e.g., with 1/t damping) by allowing time-dependence in k and setting u=v+η(t)g(x). I leave that for future work.

TL;DR: All this came out of picking a residual form and then attaching a decay assumption.

Footnotes

  1. Alvarez, F., Attouch, H., Bolte, J. and Redont, P., 2002. A second-order gradient-like dissipative dynamical system with hessian-driven damping.: Application to optimization and mechanics. Journal de mathématiques pures et appliquées, 81(8), pp.747-779.