(づ•ᴥ•)づ┬─┬

Notes

Scale and Direction: Understanding Homogeneous Functions

TL;DR: This post collects useful facts about homogeneous functions—functions satisfying f(ax)=arf(x). Key insight: they decompose into spherical behavior plus radial scaling. I also connect to two references in ML.

A function f is (positively) r-homogeneous if f(ax)=arf(x) for all a0 and xX, e.g., Xn. We will use Hr(X,Y) to denote such functions from X to Y and Hr(X) for functions from X to . We will be using Hr when context is sufficient and H when we are talking about homogeneous functions in general.

Supposing we work in a normed space with norm ., a fun little substitution is x=x·x/x which gives

f(x)=xrf(x/x).

In other words, an Hr function can do interesting stuff on the unit sphere and then we apply a scale term that is independent of f to get the value at x. Especially if fH0 we get f(x)=f(x/x), i.e., f is constant on rays.

If we are on >0, then fHr means f(x)=kxr. There’s also Euler’s theorem which gives some implications about the derivatives of f.

Another interesting property: if x,yX, a*, and fH0 and continuous, then

limaf(ax+y)=f(x).

There are various ways to show this, e.g., as fH0, we have

f(ax+y)=f(x+y/a)f(x),

if a is large, and continuity does the rest.

As building blocks

One may want to construct more complex functions by using H. There are some ways to do this:

Composition: If fH0 and we get some generic g, then gf(ax)=g(f(ax))=g(f(x)), so gfH0. However, for fg(ax)=fg(x) we need gH1H0.

Multiplication1: If fHr and gHk, then f·gHr+k.

Activations and Linear Maps: If A is some matrix, c is a positive scalar, and σ is ReLU, then:

f(cx)=σ(A(cx))=σ(cAx)=cσ(Ax),

and so we have shown that f is positively 1-homogeneous. Accordingly, and as long as we use appropriate activation functions, all NNs without bias2 are H1.

Averages: All sorts of averages are also H1. For example, if c>0, the arithmetic mean satisfies g(cx1,,cxn)=c(x1++xn)/n=cg(x1,,xn).

These building blocks appear throughout deep learning. As a practical example:

Normalization and scale separation: Batch normalization provides a practical example of scale-direction separation. By normalizing inputs to unit variance (after centering), it removes scale information—similar to our decomposition f(x)=xrf(x/x) where scale (xr) and direction (f(x/x)) are separated. The ϵ stabilization term and learnable parameters mean BatchNorm is only approximately homogeneous, but it demonstrates how normalization relates to homogeneity in practice.

Related work

I haven’t done a particularly deep dive in the references for this, but one work that I enjoyed reading is from Merrill, W., et al.3. Among other things, the authors look into the approximate homogeneity of transformers with respect to their parameters, i.e., f(x;cθ)ckf(x;θ). Transformers without bias terms are shown to be approximately H1.

  1. Fun fact, we cannot define a group over all H functions with multiplication as the operation. That’s because not all multiplicative inverses are well-defined for elements of H.

  2. A further study of those appears here: Ji, Z. and Telgarsky, M., 2020. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems33, pp.17176-17186.

  3. Merrill, W., Ramanujan, V., Goldberg, Y., Schwartz, R. and Smith, N.A., 2021, November. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1766-1781).