(づ•ᴥ•)づ┬─┬

Notes

Compact transformers can't learn all sequences.

TL;DR This paper studies a subset of possible transformer models called “compact transformers” (CT). Those are transformers that have compact input embeddings and compact positional encodings in every layer. The authors prove that such transformers cannot learn to predict every possible sequence with confidence – instead the sequence space is separated into equivalence classes and a CT can only learn one representative from each – a consequence of (uniform) continuity. In this post, I write some notes on this work, from a slightly more abstract point of view.

Notation

First, some notation, borrowed from the paper:

The paper is using a relativized Hamming distance to measure distance between partial sequences a,bΣn:

dH(a,b)=|{i{1,n}:aibi|n,

which can be extended to infinite sequences like so:

dH(a,b)=lim infndH(a1:n,b1:n)

This paper is an analyst’s best dream – so many ϵ,δ and statements that remind of ergodic theory.

One of the main results1 states:

Let T be a compact decoder-only Transformer. Then for any ϵ>0, there exists a δ>0 such that for any n, for any sequence a,bΣn with the same last token, if dH(a,b)<δ, then T(a)T(b)ϵ

This is a uniform continuity statement! The condition on the same last token has to do with next token prediction (the last token representation has a direct influence in the next token distribution).

Eventual learnability: The authors also define “eventual learnability of a sequence” as

A decoder-only transformer T eventually learns an infinite sequence a=a1,a2,,Σω if there exists ϵ>0 and n0 such that for all nn0 we have T(a1:n)(an+1)T(a1:n)(σ)+ϵ for all σΣ\{an+1}.

So, we can fail to predict some initial set with confidence, but eventually we get confidence for the rest of the sequence.

My notes

The condition on the last token can be folded into the metric, e.g., we can define for a,bΣn

d*(a,b)=1[anbn]+dH(a1:(n1),b1:(n1)),

and it seems to me this would work fine as it separates the contribution of the last token from the contribution of the rest (touching the last token would have the biggest effect, but this is mostly relevant to current transformer design).

To abstract a bit further, let us suppose we have a sequence of functions fn:ΣnΔ(Σ), all of which are uniformly continuous across n with respect to d*: for all ϵ>0, δ>0 such that for all n and any a,bΣn satisfying d*(a,b)<δ we have fn(a)fn(b)<ϵ. We will call this family a “continuous sequence predictor” (CSP).

Eventual learnability (EL) for CSP is similar to what the authors define: A sequence aΣω is eventually-learnable by the CSP if there exists an ϵ>0 and n0 such that for all nn0 we have fn(a1:n)(an+1)fn(a1:n)(σ)+ϵ, for all σΣ\{an+1}.

Two points about EL (that are made by the authors):

We may also define “eventual equality” (EE). The sequence aΣω is eventually equal to bΣω (denoted as ab) if there exists n1 such that for all nn1 we have an=bn. In other words, b is just a perturbation of a on a finite number of elements.

Having ab means that limnd*(a1:n,b1:n)=0. After some number of terms we will skip all of the perturbations and the last terms will always match, an=bn. Let’s say that there are k perturbations in total between the sequences. Then,

d*(a1:n,b1:n)=dH(a1:(n1),b1:(n1))=1n|{i=1,,n:aibi}|=kn1,

and so ab implies what we would want. As far as d* is concerned, the full sequences are the same.

If a is eventually learnable and ba, then b is also eventually learnable. We can sketch the equivalent of Proposition 1 in the paper.

Sketch: Suppose a is eventually-learnable and ab. Then, limnd*(a1:n,b1:n)=0, which implies fn(a1:n)fn(b1:n)0 (due to uniform continuity).

Why is b eventually learnable? The core argument is that since ab, eventually the final terms are the same for all n greater than some n1. From uniform continuity, and given a fixed ϵ1, we can get n2n1 such that for all nn2,

fn(b1:n)(bn+1)fn(a1:n)(an+1)ϵ1,

(remembering that an+1=bn+1). However, a is EL, and so there exists an ϵ and n3 such that nn3 implies

fn(b1:n)(bn+1)fn(a1:n)(an+1)ϵ1fn(a1:n)(σ)+ϵϵ1,

and now it is just a matter of applying uniform continuity again, this time with σbn+1 to get

fn(a1:n)(σ)fn(b1:n)(σ)ϵ3,

and the rest is just tidying up of the ϵ.

Isolation of learnable sequences

One of the fun results from the paper is that things don’t work as well when we make an infinite number of perturbations to an aΣω. To state this, the authors extend the Hamming metric to Σω:

dH(a,b)=lim infndH(a1:n,b1:n).

We can write the Isolation theorem (proved in the paper) in the CSP language.

Let (fn) be a CSP. Then, for any aΣω, EL by (fn), there exists δ>0 such that no sequence b that 1) differs from a at infinite positions, and 2) satisfies dH(a,b)<δ, is EL by (fn).

What does this say? Essentially there are Hamming balls B(a) around elements aΣω that are eventually-learnable by the CSP and if b is in B(a), then it can only be EL by CSP if ab. The poor EL sequences are isolated.

Sketch: The trick here is that if both sequences are EL, then the CSP needs to confidently pick apart the differences between the two sequences, and they differ at an infinite number of points. However, the CSP is also uniformly continuous, so there’s a limit to how much it can pull things apart, especially as we can pick sequences that are very close in Hamming and still have infinite differences (just change a sequence a every k terms, for example). You can see how the contradiction arises.

In a follow-up, I’ll look at more consequences of this framework.

  1. Theorem 1 of https://arxiv.org/abs/2505.10606