(づ•ᴥ•)づ┬─┬

Notes

Some Linear Algebra from Lewis Signaling Games

TL;DR: In this post, I jot around some linear algebra inspired by Lewis signaling games. I was introduced to them by reading this nice blogpost by Tomek Korbak. The optimal solutions turn out to be permutation matrices—in fact, these are the only stochastic matrices with stochastic inverses. When the sender is fixed and fuzzy, the best the receiver can do is Bayesian inference.

In the simplest signaling game, two players, a sender S and a receiver R have to collaborate towards an end goal. Some notation first:

World states: we generate Wt~W(Wt|Wt1), where “W” represents the “world” process and Wt is its state at time t. The state can take one of the values I={1,,N}.

Message states: S can observe this state and emit a message, Mt~S(Mt|Wt), where we assume MtI; Mt represents a message from S to R at time t. (It is important that the set of possible symbols is finite).

Action states:, R looks at the message and picks an action, At~R(At|Mt), and again we assume AtI.

Optimal action given world state: It is assumed that each world state has a unique optimal action, and so, without loss of generality, we can further assume that the mapping between world states and optimal actions is the identity, i.e., if Wt=k, then the optimal action is At=k.

The point of the game: R can only see the world by deciphering what S is communicating about the world. R wants to pick the best action for each world state and S wants R to succeed – their interests are aligned!

Setting up the algebra

Abusing notation a bit, let us assume that R and S are represented by transition matrices, i.e., both have non-negative elements and each of their columns1 sum to 1. By this definition, the permutation matrices in n are also transition matrices.

Then, if the distribution of Wt is represented by the probability vector pWt, the sender would pick a message state with distribution

pMt:=SpWt

and the distribution of actions at time t

pAt:=RpMt.

So, in short:

pAt=RSpWt.

Since we assumed that the mapping from world states to actions is the identity, the signaling game solution would satisfy

(IRS)pWt=0,

for all pWt and so RS=I. S would encode information from the world and R would decode it to pick the best action.

What if S is not a permutation matrix? For example,

Mt~SpWt=(0.50.90.50.1)(10)=(0.50.5).

This can’t be good for R: the state Wt=1 would be encoded as Mt=1 half of the time and as Mt=0 the rest of the time, so it will be impossible for R to always pick the best action. Fuzziness allows for mistakes, which is why picking S to be a permutation matrix gives an optimal solution.

There are n! such permutation matrices and any one of them would work, they are the Nash equilibriums of this game. Without extra constraints, it is up to the players to pick which of those solutions work.

What if S is fixed?

If S is a fixed permutation matrix, then the best R=S1 (as discussed above). We could see this as if decoding a fixed, but unknown, language.

What if S is not a permutation matrix?

Then R=S1 doesn’t quite work, the inverse of an arbitrary transition matrix is not necessarily a transition matrix. For example

S=(0.51.00.50.0)

would give

R=(0211)

which has a negative element2.

As discussed, fuzziness creates mistakes, so it makes sense that a lossless decoder does not exist in this case.

We are now getting further from the realm of signaling games, but we can ask what would be the optimal R in this case. What is the best choice for an action when we get a message M~Spw? We can get the distribution of W|M from Bayes theorem:

P(W=w|M=m)=P(M=m|W=w)P(w)/P(m)

and then picking the best action given a message M=m is just a matter of 3 argmaxw P(w|M).

Last thoughts

In this post, we wrote the linear algebra version of signaling games by using transition matrices, separating the cases where good equilibria exist and those that are not so lucky.

We didn't touch on this, but an additional way to think about LLMs: during pre-training, they act as receivers trying to decode a fixed sender (human language). Since human language is fuzzy, S is far from a permutation matrix, perfect decoding is impossible, and the best they can do is Bayesian inference.

  1. By convention, transition matrices have rows sum to one, but then we would have to write everything as xP instead of Px.

  2. In fact, if both S and S1 are transition matrices (non–negative and with columns summing to 1), then S (and S1) must be a permutation matrix. (More generally: if S0 and S10 but we drop the “columns sum to 1” constraint, then S must be monomial.)

  3. Remember that we assumed that the optimal world-to-action mapping is the identity.