Compact transformers can't learn all sequences.
TL;DR This paper studies a subset of possible transformer models called “compact transformers” (CT). Those are transformers that have compact input embeddings and compact positional encodings in every layer. The authors prove that such transformers cannot learn to predict every possible sequence with confidence – instead the sequence space is separated into equivalence classes and a CT can only learn one representative from each – a consequence of (uniform) continuity. In this post, I write some notes on this work, from a slightly more abstract point of view.
Notation
First, some notation, borrowed from the paper:
- is our finite alphabet of tokens and is the space of finite sequences.
- We will use for all sequences of length , .
- is the space of all infinite sequences.
- is the space of distributions over .
- for and .
- The norm used is .
The paper is using a relativized Hamming distance to measure distance between partial sequences :
which can be extended to infinite sequences like so:
This paper is an analyst’s best dream – so many and statements that remind of ergodic theory.
One of the main results1 states:
Let T be a compact decoder-only Transformer. Then for any , there exists a such that for any , for any sequence with the same last token, if , then
This is a uniform continuity statement! The condition on the same last token has to do with next token prediction (the last token representation has a direct influence in the next token distribution).
Eventual learnability: The authors also define “eventual learnability of a sequence” as
A decoder-only transformer eventually learns an infinite sequence if there exists and such that for all we have for all .
So, we can fail to predict some initial set with confidence, but eventually we get confidence for the rest of the sequence.
My notes
The condition on the last token can be folded into the metric, e.g., we can define for
and it seems to me this would work fine as it separates the contribution of the last token from the contribution of the rest (touching the last token would have the biggest effect, but this is mostly relevant to current transformer design).
To abstract a bit further, let us suppose we have a sequence of functions , all of which are uniformly continuous across with respect to : for all , such that for all and any satisfying we have . We will call this family a “continuous sequence predictor” (CSP).
Eventual learnability (EL) for CSP is similar to what the authors define: A sequence is eventually-learnable by the CSP if there exists an and such that for all we have , for all .
Two points about EL (that are made by the authors):
- EL is tied to the CSP (or to the transformer we are interested in, etc.), so we will say that a sequence is “EL by the CSP” or “EL by ”, as required.
- EL is about confidence in making a prediction, as can be seen by the definition.
We may also define “eventual equality” (EE). The sequence is eventually equal to (denoted as ) if there exists such that for all we have . In other words, is just a perturbation of on a finite number of elements.
Having means that . After some number of terms we will skip all of the perturbations and the last terms will always match, . Let’s say that there are perturbations in total between the sequences. Then,
and so implies what we would want. As far as is concerned, the full sequences are the same.
If is eventually learnable and , then is also eventually learnable. We can sketch the equivalent of Proposition 1 in the paper.
Sketch: Suppose is eventually-learnable and . Then, , which implies (due to uniform continuity).
Why is eventually learnable? The core argument is that since , eventually the final terms are the same for all greater than some . From uniform continuity, and given a fixed , we can get such that for all ,
(remembering that ). However, is EL, and so there exists an and such that implies
and now it is just a matter of applying uniform continuity again, this time with to get
and the rest is just tidying up of the .
Isolation of learnable sequences
One of the fun results from the paper is that things don’t work as well when we make an infinite number of perturbations to an . To state this, the authors extend the Hamming metric to :
We can write the Isolation theorem (proved in the paper) in the CSP language.
Let be a CSP. Then, for any , EL by , there exists such that no sequence that 1) differs from at infinite positions, and 2) satisfies , is EL by .
What does this say? Essentially there are Hamming balls around elements that are eventually-learnable by the CSP and if is in , then it can only be EL by CSP if . The poor EL sequences are isolated.
Sketch: The trick here is that if both sequences are EL, then the CSP needs to confidently pick apart the differences between the two sequences, and they differ at an infinite number of points. However, the CSP is also uniformly continuous, so there’s a limit to how much it can pull things apart, especially as we can pick sequences that are very close in Hamming and still have infinite differences (just change a sequence every terms, for example). You can see how the contradiction arises.
In a follow-up, I’ll look at more consequences of this framework.
Theorem 1 of https://arxiv.org/abs/2505.10606↩