A few notes on directed information

06 Nov, 2025

TL;DR: Notes on directed information, inspired from the use of directed information in Abel, David et al. “Plasticity as the Mirror of Empowerment.” ArXiv abs/2505.10361 (2025) – I want to eventually write some notes on this work as well.

Mutual Information

Quick recap of mutual information, for more see the wiki.

Suppose we have random variables $X, Y$ with values over some space $𝒳 \times 𝒴$ . Then, the mutual information (MI) captures, on average, how much information we learn about $X$ if we observe $Y$ . MI is defined as

I (X; Y) = \sum_{x, y} p (x, y) \log \frac{p (x, y)}{p (x) p (y)} = KL (p (x, y) ‖ p (x) p (y)) .

Note the tension between the joint $p (x, y)$ and the marginal product $p (x) p (y)$ .

There are lots of useful relationships around $I$ , but here I’ll only mention a few:

Symmetry: $I (X; Y) = I (Y; X)$ .Because of symmetry, there is no notion of causality or time in MI.
Non-negativity: $I (X; Y) \geq 0$ .
Expressing in terms of entropy / relative entropy: $I (X; Y) = H (X) - H (X | Y) = H (Y) - H (Y | X)$ .

From 3., we can also get that if X and Y are independent, then $I (X; Y) = 0$ , i.e., there’s no information to be learnt from $Y$ about $X$ . On the other side, if, for example, $X = Y$ , then $H (X | Y) = 0$ , so $I (X; X) = H (X)$ , i.e., we have learned all there is to know.

Directed Information

To introduce directed information, we will use notation from the original paper by Massey, 1990¹.

For $n \geq 1$ , $X^{n} : = [X_{1}, \dots, X_{n}]$ , each component being a discrete random variable. Values of those will be denoted by $x^{n} : = [x_{1}, \dots, x_{n}]$ . We use square brackets to denote that those objects are sequences of random variables.

In the original work¹, Massey was studying the behavior of a discrete channel with input (X) and output (Y) random variables. One can make all sorts of assumptions on how an input variable $X_{k}$ affects the corresponding output variable $Y_{k}$ . For example, we can assume that $X_{1} \to Y_{1}$ , then $X_{2} \to Y_{2}$ , (arrows here denote dependence) and so forth, each generation being independent of the previous one.

A bit more interesting is to assume that each output variable plays a role into the generation of the next one, the DAG looking something like this:

Untitled-2025-11-06-1143

How much information do we get about $Y^{N}$ if we know $X^{N}$ ? We could try to answer this question with $I (X^{N}; Y^{N})$ , but this is a symmetric quantity and doesn’t take into account the order the variables enter the picture.

This is where directed information (DI) can help:

I (X^{N} \to Y^{N}) : = \sum_{i = 1}^{N} I (X^{i}; Y_{i} | Y^{i - 1}) .

Here $I (X^{i}; Y_{i} | Y^{i - 1})$ is regular conditional MI. DI captures how information flows through the channel, from input to output.

There are a few nice properties of DI as well, here are some I liked:

$I (X^{N} \to Y^{N}) \neq I (Y^{N} \to X^{N})$ : this asymmetry is characteristic of DI. For example, one can show it by considering $n = 2$ and $Y_{2} = X_{2}$ and $Y_{1} = X_{1}$ .
$I (X^{N} \to Y^{N}) \leq I (X^{N}; Y^{N})$ : each term of the DI sum contains the partial sequence $X^{i}$ and replacing by the full sequence $X^{N}$ can only increase mutual information, which then gives the result (by means of a telescopic sum). Equality holds when there is no feedback from output back to input.
Conservation of information! $I (X^{N}; Y^{N}) = I (X^{N} \to Y^{N}) + I (Y^{N - 1} \to X^{N})$ , note that the second term starts from $Y^{N - 1}$ . Mutual information captures both the forward and the backward flow of information, hence the symmetry.

Proving conservation of information (sketch)

There are various ways to show 3 for DI. For example, for an appropriate definition of conditional entropy $H (X^{N} ‖ Y^{N})$ (called causally conditioned entropy), we get an equivalent formula to MI’s 3:

I (X^{N} \to Y^{N}) = H (Y^{N}) - H (Y^{N} ‖ X^{N}) .

Then it is sufficient to show that

H (X^{N}, Y^{N}) = H (X^{N} ‖ Y^{N}) + H (Y^{N} ‖ X^{N}),

which requires the decomposition rule for causal conditioning (denoted by $‖$ )

P (x^{N}, y^{N}) = P (x^{N} ‖ y^{N - 1}) P (y^{N} ‖ x^{N})

and the definition of joint entropy as

H (X^{N}, Y^{N}) : = - \sum_{x^{N}, y^{N}} P (x^{N}, y^{N}) \log P (x^{N}, y^{N}) .

The $I (Y^{N - 1} \to X^{N})$ term

The backward $I (Y^{N - 1} \to X^{N})$ can be interpreted as the strength of feedback: how much past outputs influence future inputs.

Generalized Directed Information

To handle causal interactions that may begin or end at different times, we can generalize Massey’s directed information to arbitrary causal windows. Following Abel et al. (2025), the Generalized Directed Information (GDI) between sequences $X_{a : b}$ and $Y_{c : d}$ is defined as

I (X_{a : b} \to Y_{c : d}) : = \sum_{i = max (a, c)}^{d} I (X_{a : min (b, i)}; Y_{i} ∣ X_{1 : a - 1}, Y_{1 : i - 1}) .

This quantity measures how much information flows from the segment $X_{a : b}$ into the segment $Y_{c : d}$ , respecting the causal structure of the process. When $a = c = 1$ and $b = d = n$ , it reduces to Massey’s original directed information $I (X^{n} \to Y^{n})$ .

GDI also satisfies an entropy–difference identity, sometimes called the Kramer Decomposition:

I (X_{a : b} \to Y_{c : d}) = H (Y_{max (a, c) : d} ∣ X_{1 : a - 1}) - \sum_{i = max (a, c)}^{d} H (Y_{i} ∣ Y_{1 : i - 1}, X_{a : min (b, i)}) .

This form generalizes the standard relation $I (X^{n} \to Y^{n}) = H (Y^{n}) - H (Y^{n} | X^{n})$ , extending it to arbitrary causal intervals.

Massey, J. (1990, November). Causality, feedback and directed information. In Proc. Int. Symp. Inf. Theory Applic.(ISITA-90) (Vol. 2, p. 1).↩