Compact transformers can't learn all sequences.

08 Feb, 2026

TL;DR This paper studies a subset of possible transformer models called “compact transformers” (CT). Those are transformers that have compact input embeddings and compact positional encodings in every layer. The authors prove that such transformers cannot learn to predict every possible sequence with confidence – instead the sequence space is separated into equivalence classes and a CT can only learn one representative from each – a consequence of (uniform) continuity. In this post, I write some notes on this work, from a slightly more abstract point of view.

Notation

First, some notation, borrowed from the paper:

$Σ$ is our finite alphabet of tokens and $Σ^{*}$ is the space of finite sequences.
We will use $Σ^{n}$ for all sequences of length $n$ , $n \in ℕ$ .
$Σ^{ω}$ is the space of all infinite sequences.
$Δ (Σ)$ is the space of distributions over $Σ$ .
$a_{1 : n} = (a_{1}, \dots, a_{n})$ for $n \in ℕ$ and $a_{i} \in Σ$ .
The norm used is $l^{\infty}$ .

The paper is using a relativized Hamming distance to measure distance between partial sequences $a, b \in Σ^{n}$ :

d_{H} (a, b) = \frac{| {i \in {1 \dots, n} : a_{i} \neq b_{i} |}{n},

which can be extended to infinite sequences like so:

d_{H} (a, b) = {lim inf}_{n \to \infty} d_{H} (a_{1 : n}, b_{1 : n})

This paper is an analyst’s best dream – so many $ϵ, δ$ and statements that remind of ergodic theory.

One of the main results¹ states:

Let T be a compact decoder-only Transformer. Then for any $ϵ > 0$ , there exists a $δ > 0$ such that for any $n \in ℕ$ , for any sequence $a, b \in Σ^{n}$ with the same last token, if $d_{H} (a, b) < δ$ , then $‖ T (a) - T (b) ‖ \leq ϵ$

This is a uniform continuity statement! The condition on the same last token has to do with next token prediction (the last token representation has a direct influence in the next token distribution).

Eventual learnability: The authors also define “eventual learnability of a sequence” as

A decoder-only transformer $T$ eventually learns an infinite sequence $a = a_{1}, a_{2}, \dots, \in Σ^{ω}$ if there exists $ϵ > 0$ and $n_{0}$ such that for all $n \geq n_{0}$ we have $T (a_{1 : n}) (a_{n + 1}) \geq T (a_{1 : n}) (σ) + ϵ$ for all $σ \in Σ \ {a_{n + 1}}$ .

So, we can fail to predict some initial set with confidence, but eventually we get confidence for the rest of the sequence.

My notes

The condition on the last token can be folded into the metric, e.g., we can define for $a, b \in Σ^{n}$

d_{*} (a, b) = 1 [a_{n} \neq b_{n}] + d_{H} (a_{1 : (n - 1)}, b_{1 : (n - 1)}),

and it seems to me this would work fine as it separates the contribution of the last token from the contribution of the rest (touching the last token would have the biggest effect, but this is mostly relevant to current transformer design).

To abstract a bit further, let us suppose we have a sequence of functions $f_{n} : Σ^{n} \to Δ (Σ)$ , all of which are uniformly continuous across $n$ with respect to $d_{*}$ : for all $ϵ > 0$ , $\exists δ > 0$ such that for all $n$ and any $a, b \in Σ^{n}$ satisfying $d_{*} (a, b) < δ$ we have $‖ f_{n} (a) - f_{n} (b) ‖ < ϵ$ . We will call this family a “continuous sequence predictor” (CSP).

Eventual learnability (EL) for CSP is similar to what the authors define: A sequence $a \in Σ^{ω}$ is eventually-learnable by the CSP if there exists an $ϵ > 0$ and $n_{0}$ such that for all $n \geq n_{0}$ we have $f_{n} (a_{1 : n}) (a_{n + 1}) \geq f_{n} (a_{1 : n}) (σ) + ϵ$ , for all $σ \in Σ \ {a_{n + 1}}$ .

Two points about EL (that are made by the authors):

EL is tied to the CSP (or to the transformer $T$ we are interested in, etc.), so we will say that a sequence is “EL by the CSP” or “EL by $T$ ”, as required.
EL is about confidence in making a prediction, as can be seen by the definition.

We may also define “eventual equality” (EE). The sequence $a \in Σ^{ω}$ is eventually equal to $b \in Σ^{ω}$ (denoted as $a \equiv b$ ) if there exists $n_{1}$ such that for all $n \geq n_{1}$ we have $a_{n} = b_{n}$ . In other words, $b$ is just a perturbation of $a$ on a finite number of elements.

Having $a \equiv b$ means that ${lim}_{n \to \infty} d_{*} (a_{1 : n}, b_{1 : n}) = 0$ . After some number of terms we will skip all of the perturbations and the last terms will always match, $a_{n} = b_{n}$ . Let’s say that there are $k$ perturbations in total between the sequences. Then,

d_{*} (a_{1 : n}, b_{1 : n}) = d_{H} (a_{1 : (n - 1)}, b_{1 : (n - 1)}) = \frac{1}{n} | {i = 1, \dots, n : a_{i} \neq b_{i}} | = \frac{k}{n - 1},

and so $a \equiv b$ implies what we would want. As far as $d_{*}$ is concerned, the full sequences are the same.

If $a$ is eventually learnable and $b \equiv a$ , then $b$ is also eventually learnable. We can sketch the equivalent of Proposition 1 in the paper.

Sketch: Suppose $a$ is eventually-learnable and $a \equiv b$ . Then, ${lim}_{n \to \infty} d_{*} (a_{1 : n}, b_{1 : n}) = 0$ , which implies $‖ f_{n} (a_{1 : n}) - f_{n} (b_{1 : n}) ‖ \to 0$ (due to uniform continuity).

Why is $b$ eventually learnable? The core argument is that since $a \equiv b$ , eventually the final terms are the same for all $n$ greater than some $n_{1}$ . From uniform continuity, and given a fixed $ϵ_{1}$ , we can get $n_{2} \geq n_{1}$ such that for all $n \geq n_{2}$ ,

f_{n} (b_{1 : n}) (b_{n + 1}) \geq f_{n} (a_{1 : n}) (a_{n + 1}) - ϵ_{1},

(remembering that $a_{n + 1} = b_{n + 1}$ ). However, $a$ is EL, and so there exists an $ϵ$ and $n_{3}$ such that $n \geq n_{3}$ implies

f_{n} (b_{1 : n}) (b_{n + 1}) \geq f_{n} (a_{1 : n}) (a_{n + 1}) - ϵ_{1} \geq f_{n} (a_{1 : n}) (σ) + ϵ - ϵ_{1},

and now it is just a matter of applying uniform continuity again, this time with $σ \neq b_{n + 1}$ to get

f_{n} (a_{1 : n}) (σ) \geq f_{n} (b_{1 : n}) (σ) - ϵ_{3},

and the rest is just tidying up of the $ϵ$ . $◻$

Isolation of learnable sequences

One of the fun results from the paper is that things don’t work as well when we make an infinite number of perturbations to an $a \in Σ^{ω}$ . To state this, the authors extend the Hamming metric to $Σ^{ω}$ :

d_{H} (a, b) = {lim inf}_{n} d_{H} (a_{1 : n}, b_{1 : n}) .

We can write the Isolation theorem (proved in the paper) in the CSP language.

Let $(f_{n})$ be a CSP. Then, for any $a \in Σ^{ω}$ , EL by $(f_{n})$ , there exists $δ > 0$ such that no sequence $b$ that 1) differs from $a$ at infinite positions, and 2) satisfies $d_{H} (a, b) < δ$ , is EL by $(f_{n})$ .

What does this say? Essentially there are Hamming balls $B (a)$ around elements $a \in Σ^{ω}$ that are eventually-learnable by the CSP and if $b$ is in $B (a)$ , then it can only be EL by CSP if $a \equiv b$ . The poor EL sequences are isolated.

Sketch: The trick here is that if both sequences are EL, then the CSP needs to confidently pick apart the differences between the two sequences, and they differ at an infinite number of points. However, the CSP is also uniformly continuous, so there’s a limit to how much it can pull things apart, especially as we can pick sequences that are very close in Hamming and still have infinite differences (just change a sequence $a$ every $k$ terms, for example). You can see how the contradiction arises. $◻$

In a follow-up, I’ll look at more consequences of this framework.

Theorem 1 of https://arxiv.org/abs/2505.10606↩