Scale and Direction: Understanding Homogeneous Functions
TL;DR: This post collects useful facts about homogeneous functions—functions satisfying . Key insight: they decompose into spherical behavior plus radial scaling. I also connect to two references in ML.
A function is (positively) r-homogeneous if for all and , e.g., . We will use to denote such functions from to and for functions from to . We will be using when context is sufficient and when we are talking about homogeneous functions in general.
Supposing we work in a normed space with norm , a fun little substitution is which gives
In other words, an function can do interesting stuff on the unit sphere and then we apply a scale term that is independent of to get the value at . Especially if we get i.e., is constant on rays.
If we are on , then means . There’s also Euler’s theorem which gives some implications about the derivatives of .
Another interesting property: if , , and and continuous, then
There are various ways to show this, e.g., as , we have
if is large, and continuity does the rest.
As building blocks
One may want to construct more complex functions by using . There are some ways to do this:
Composition: If and we get some generic , then , so . However, for we need .
Multiplication1: If and , then .
Activations and Linear Maps: If is some matrix, is a positive scalar, and is ReLU, then:
and so we have shown that is positively 1-homogeneous. Accordingly, and as long as we use appropriate activation functions, all NNs without bias2 are .
Averages: All sorts of averages are also . For example, if , the arithmetic mean satisfies .
These building blocks appear throughout deep learning. As a practical example:
Normalization and scale separation: Batch normalization provides a practical example of scale-direction separation. By normalizing inputs to unit variance (after centering), it removes scale information—similar to our decomposition where scale () and direction () are separated. The stabilization term and learnable parameters mean BatchNorm is only approximately homogeneous, but it demonstrates how normalization relates to homogeneity in practice.
Related work
I haven’t done a particularly deep dive in the references for this, but one work that I enjoyed reading is from Merrill, W., et al.3. Among other things, the authors look into the approximate homogeneity of transformers with respect to their parameters, i.e., . Transformers without bias terms are shown to be approximately .
Fun fact, we cannot define a group over all functions with multiplication as the operation. That’s because not all multiplicative inverses are well-defined for elements of .↩
A further study of those appears here: Ji, Z. and Telgarsky, M., 2020. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33, pp.17176-17186.↩
Merrill, W., Ramanujan, V., Goldberg, Y., Schwartz, R. and Smith, N.A., 2021, November. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1766-1781).↩