(づ•ᴥ•)づ┬─┬

Notes

Fun with stochastic activations

This post is inspired by a paper written by Maria Lomeli and coauthors at FAIR, the Ecole Normale Supérieure Paris Saclay, and Paris Cité University; see Stochastic Activation Functions

Gist of paper

Similar to how dropout, LayerDrop, etc., pick parts of a feedforward network and change those during training to increase robustness, stochastic activations switch between different activation functions, e.g., ReLU and SILU. This forces the network to be robust to using either activation. SILU allows for more gradient info to pass through, so convergence looks better, whereas ReLU induces sparsity which can be used for faster inference at test time.

The authors focus on LLM tuning, and specifically the feed-forward networks in the LLM. Two strategies are considered:

For reference, ReLU is defined as max{0,x} whereas SILU is x·σ(x), where σ(x)=1/(1+exp(x)).

Thoughts

There are a lot of nice questions here!

I don't have the capacity right now to investigate those questions for LLMs, so instead I set up a simple classification problem with a small FFN.

I'll compare the following activations, using them for both training and testing as they are.

I also added two more activations into the mix!

learnable_stoch: This doesn't switch between the two per se, instead we do a continuous mixture: f(x)=ReLU(x)α(x)+(1α(x))SILU(x), and α(x)[0,1], represented by a small neural net. This probably won't give us any sparsity, it only adapts the behavior of the activation when x<0 (as SILU and ReLU match for x>0).

gumbel_stoch: This is like stoch, i.e., we sample activations, but instead of having a fixed α, we learn a little αg(x) by using the Gumbel-softmax trick. This way we still get some trackable sparsity because we will still use either ReLU or SILU.

Setup

You can find the code here. To reproduce my run, use

uv run python stoch_activation.py --epochs 100 --plots --lr 0.001 --n_clas
ses 2 --n_features 30 --n_layers 4 --n_samples 5000 --n_runs 10 --n_workers 10

Data

I generate synthetic classification data with sklearn.make_classification.

I split 80/20 into train/test with a fixed random seed, and wrap tensors in PyTorch DataLoaders (batch size 128, shuffling on train).

Model

A fully-connected network with a narrow bottleneck:

L=nlayers,b=min(4,nfeatures)nfeaturesW1bW2WLbWoutnclasses

After each hidden linear layer we apply a pluggable activation. Training uses cross-entropy on logits.

I use 4 layers for the experiment here, just so I can get more chances for the activations to do their thing.

Training and Metrics

Those are reported per epoch:

Results

All metrics are reported as averages over the ten runs, with std intervals. This is still a small experiment, but we can see some interesting behaviors.

train_loss

Training loss is unremarkable, but we can see some rough ordering and the picture for the validation loss is similar.

val_loss

Sparsity over time on the validation set. It aligns with expectations:

sparsity

We can see those pictures by also checking the distribution of the "gates" for learnable_stoch and gumbel_stoch.

The learnable stochastic gate quickly converges to an even 0.5–0.5 mix of ReLU and SiLU. Since the gating network only really has something to decide when x<0 (both activations behave the same for x>0), this makes sense.

gate_weights_learnable_stoch

Dynamics of gumbel stoch are a bit more interesting! There is short term preference for SILU and then we equilibrate. I guess if we have to select one to use between the two, it would make the most difference to pick SILU in the beginning of the loss descent.

gate_weights_gumbel_stoch

Other behaviors

Surprisingly, if we make the model large enough to overfit, the dynamics of the learned activations change and there is a strong preference for ReLU. Here's what happens if we set the bottleneck dimension from 4 to 16.

gate_weights_gumbel_stoch gate_weights_learnable_stoch

val_loss