Some Linear Algebra from Lewis Signaling Games
TL;DR: In this post, I jot around some linear algebra inspired by Lewis signaling games. I was introduced to them by reading this nice blogpost by Tomek Korbak. The optimal solutions turn out to be permutation matrices—in fact, these are the only stochastic matrices with stochastic inverses. When the sender is fixed and fuzzy, the best the receiver can do is Bayesian inference.
In the simplest signaling game, two players, a sender and a receiver have to collaborate towards an end goal. Some notation first:
World states: we generate , where “W” represents the “world” process and is its state at time . The state can take one of the values
Message states: can observe this state and emit a message, , where we assume ; represents a message from to at time (It is important that the set of possible symbols is finite).
Action states:, looks at the message and picks an action, , and again we assume .
Optimal action given world state: It is assumed that each world state has a unique optimal action, and so, without loss of generality, we can further assume that the mapping between world states and optimal actions is the identity, i.e., if , then the optimal action is .
The point of the game: can only see the world by deciphering what is communicating about the world. wants to pick the best action for each world state and wants to succeed – their interests are aligned!
Setting up the algebra
Abusing notation a bit, let us assume that and are represented by transition matrices, i.e., both have non-negative elements and each of their columns1 sum to . By this definition, the permutation matrices in are also transition matrices.
Then, if the distribution of is represented by the probability vector , the sender would pick a message state with distribution
and the distribution of actions at time
So, in short:
Since we assumed that the mapping from world states to actions is the identity, the signaling game solution would satisfy
for all and so would encode information from the world and would decode it to pick the best action.
What if is not a permutation matrix? For example,
This can’t be good for : the state would be encoded as half of the time and as the rest of the time, so it will be impossible for to always pick the best action. Fuzziness allows for mistakes, which is why picking S to be a permutation matrix gives an optimal solution.
There are such permutation matrices and any one of them would work, they are the Nash equilibriums of this game. Without extra constraints, it is up to the players to pick which of those solutions work.
What if is fixed?
If is a fixed permutation matrix, then the best (as discussed above). We could see this as if decoding a fixed, but unknown, language.
What if is not a permutation matrix?
Then doesn’t quite work, the inverse of an arbitrary transition matrix is not necessarily a transition matrix. For example
would give
which has a negative element2.
As discussed, fuzziness creates mistakes, so it makes sense that a lossless decoder does not exist in this case.
We are now getting further from the realm of signaling games, but we can ask what would be the optimal in this case. What is the best choice for an action when we get a message ? We can get the distribution of from Bayes theorem:
and then picking the best action given a message is just a matter of 3 .
Last thoughts
In this post, we wrote the linear algebra version of signaling games by using transition matrices, separating the cases where good equilibria exist and those that are not so lucky.
We didn't touch on this, but an additional way to think about LLMs: during pre-training, they act as receivers trying to decode a fixed sender (human language). Since human language is fuzzy, S is far from a permutation matrix, perfect decoding is impossible, and the best they can do is Bayesian inference.
By convention, transition matrices have rows sum to one, but then we would have to write everything as instead of .↩
In fact, if both and are transition matrices (non–negative and with columns summing to ), then (and ) must be a permutation matrix. (More generally: if and but we drop the “columns sum to ” constraint, then must be monomial.)↩
Remember that we assumed that the optimal world-to-action mapping is the identity.↩