Targeting the gradient flow
In this post, we will first summarize what we discussed in part 1 and part 2, then add a few extra thoughts.
Summary of the two posts
Part 1: we picked a function and built a simple vector field such that for any trajectory , the trajectory is attracted to the graph of , in the sense that if , then . To do this, we defined the residual and forced it to satisfy the ODE (which also gives an ODE for ). We also picked an arbitrary direction for to move with . We showed that the corresponding vector field works for our purposes!
We discussed that we can extend the construction to vector fields on as well as the case of being a function of instead of constant.
Finally, given an existing vector field and the assumption , we showed that and satisfy a transport PDE:
Part 2: we applied some of these ideas to optimization. Starting from a smooth for which we want to find a minimum, we created various vector fields such that the trajectories following them converge to the stationary points of . We used a similar construction, setting and getting an ODE for from and .
We also discovered that we have some freedom in picking , i.e., for any well-behaved , setting will still give us convergent dynamics as long as we are not completely cancelling close to the stationary points. This opens up new possibilities, e.g., can also be a stochastic function, etc.
Follow-up questions and thoughts
It is fun watching the trajectories skip over some of the basins of attraction, but can we actually get more interesting behaviors out of picking the right ?
We assumed , and so no trajectory will have non-trivial periodic orbits. If a trajectory did, the residual would also have to be periodic and that is impossible as is not a periodic function. The only way for periodicity to hold is if for all . However, on the attracting manifold we are following GD which also doesn't have any non-trivial periodic orbits.
So, regardless of the choice of (assuming it is well-behaved), these methods cannot produce sustained exploration. There is an exponential clock, and once it rings, exploration is over: we are back to following gradient descent. We can give more time for exploration by adjusting the decay, e.g., setting .
We don't have to match .
In part 2, we had the residual be and (or, equivalently, have satisfy our transport PDE). However, what if instead of targeting the graph of , we target the gradient flow directly?
To do this, we will work on the space where . We set , i.e., we are getting trajectories to match the gradient flow, . Now
If we now enforce (and do a few calculations) we get:
So the same residual-decay idea produces second-order optimization dynamics if we focus on gradient flow.
I was wondering if this is a known second-order ODE. It looks like a special case of the form studied by Alvarez1, et al., : .
There's potential to recover further ODEs from the literature (e.g., with damping) by allowing time-dependence in and setting . I leave that for future work.
TL;DR: All this came out of picking a residual form and then attaching a decay assumption.
Footnotes
Alvarez, F., Attouch, H., Bolte, J. and Redont, P., 2002. A second-order gradient-like dissipative dynamical system with hessian-driven damping.: Application to optimization and mechanics. Journal de mathématiques pures et appliquées, 81(8), pp.747-779.↩