Causal Computational Asymmetry - train two networks, the faster one is the cause

Posted Feb 25, 2026

By Abdulrahman Tamim

87 min read

Assume two variables are correlated. which one causes the other? if you can run a controlled experiment, you answer that by forcing one variable to change and watching the other. but most of the time in science, medicine, economics, and policy you can't do that. you have observational data, things that already happened, and the question of which variable drove which sits unresolved no matter how large your dataset gets. does poverty cause poor health outcomes, or do health problems push people into poverty? does social media use cause depression, or do people who are already depressed spend more time scrolling? does inflammation drive cardiovascular disease, or does cardiovascular damage trigger inflammation as a downstream response? the data shows correlation in all of these cases, and correlation doesn't tell you the direction.

what this paper proposes is a new way to read that direction from data alone, using a signal that nobody had looked at before: how fast a neural network converges when you train it. if you train one network to predict $Y$ from $X$ and another to predict $X$ from $Y$, the network trained in the true causal direction converges faster. that observation is not a heuristic or an empirical pattern someone noticed, it follows from a formal proof with three chained lemmas. the paper is “Causal Direction from Convergence Time” by Abdulrahman Tamim, February 2026.

The ladder of causation and why modern AI cannot climb it

Judea Pearl won the Turing Award in 2011. his main contribution, developed over roughly two decades starting in the late 1980s, is the formal mathematical framework for reasoning about cause and effect using directed graphs and an algebra of interventions. his 2000 textbook Causality is the technical reference, and his 2018 popular book The Book of Why covers the ideas for a general audience. both organize causal reasoning around a hierarchy he calls the Ladder of Causation.

the Causal Hierarchy Theorem, proved formally by Bareinboim, Correa, Ibeling, and Icard in 2022, establishes that the three rungs of this ladder almost always require genuinely different kinds of information. “almost always” here has a precise technical meaning: it holds for all but a measure-zero set of distributions, which in any practical setting means it always holds.

rung one: seeing. this rung covers all questions of the form $P(Y \mid X = x)$: given that $X$ took value $x$, what is the distribution of $Y$? every statistical and machine learning method ever built lives here. linear regression, logistic regression, decision trees, random forests, support vector machines, neural networks, transformers, diffusion models, all of them learn mappings from inputs to outputs using historical data. GPT-4 and every language model is a rung-one system: it models the conditional distribution of text given prior text, learned from a corpus of human-generated language. being on rung one doesn't mean being unsophisticated or weak. it means the system is learning associations from observed data, and association is fundamentally different from causation.

rung two: doing. this rung covers questions of the form $P(Y \mid \mathrm{do}(X = x))$: if you actively intervene and force $X$ to take value $x$, what happens to $Y$? the $\mathrm{do}(\cdot)$ notation was introduced by Pearl specifically because no pre-existing mathematical symbol captured this concept cleanly. conditioning on $X = x$ means you're filtering your data to cases where $X$ happened to be $x$, but all the factors that influenced $X$ in the first place are still present and operating. intervening on $X = x$ means you've disconnected $X$ from all its causes and forced it to the value you chose, like actually changing a variable in the world rather than just observing when it happens to take a particular value.

a randomized controlled trial is the physical implementation of a do-intervention: by assigning treatment randomly, you break every correlation between the treatment variable and every other characteristic of the participants, which is exactly what the do-operator does mathematically when it cuts the incoming arrows to $X$.

Pearl developed the do-calculus as a formal system for computing interventional quantities from observational data when the causal graph is known. it has three rules. rule 1 handles insertion and deletion of observations. rule 2 handles exchange between actions and observations. rule 3 handles insertion and deletion of actions. Shpitser and Pearl proved in 2006 that these three rules are complete: if you cannot reduce $P(y \mid \mathrm{do}(x))$ to a purely observational expression using these rules, then no non-parametric method can identify the causal effect from observational data. this is not a limitation of the calculus, it is a characterization of what is mathematically possible.

rung three: imagining. this rung covers counterfactual questions: $P(Y_{X=x'} \mid X = x, Y = y)$, meaning the probability that $Y$ would have taken value $y'$ if $X$ had been $x'$, given that in the actual world $X$ was $x$ and $Y$ was $y$. you took the treatment and recovered. would you have recovered without it? you hired this employee and they underperformed. would a different candidate have done better? these questions involve reasoning about worlds that didn't happen, and they require knowing not just the causal structure but also the specific noise realizations that led to the observed outcome.

the fundamental impossibility Pearl established in 1995: no algebraic manipulation of any observational distribution $P(Y \mid X)$, regardless of how much data you have, can produce $P(Y \mid \mathrm{do}(X))$ without structural assumptions about the underlying causal graph. this is what “no causes in, no causes out” means in the causal inference literature. a system trained purely on observational data cannot answer interventional questions, by proof.

CCA addresses the first step toward rung-two reasoning: before the do-calculus can compute $P(Y \mid \mathrm{do}(X))$, you need to know whether the edge between $X$ and $Y$ points as $X \to Y$ or $Y \to X$. that's the bivariate causal direction problem, and that's what this paper solves.

Prerequisites: the mathematical foundations you need

every concept used in the proofs and experiments is explained here from scratch. the goal is that someone who knows high school calculus and basic algebra can follow the full argument without needing to look anything up externally.

What a neural network is and how it represents functions

a neural network is a parametrized mathematical function built by composing layers. each layer takes a vector of numbers as input, multiplies it by a learned weight matrix, adds a learned bias vector, and passes the result through a nonlinear activation function element-wise. with $L$ layers this looks like:

$a^{(1)} = \sigma(W^{(1)} x + b^{(1)})$ $a^{(2)} = \sigma(W^{(2)} a^{(1)} + b^{(2)})$ $\vdots$ $\hat{y} = W^{(L)} a^{(L-1)} + b^{(L)}$

$x$ is the input, $\hat{y}$ is the output, $W^{(\ell)}$ and $b^{(\ell)}$ are the weight matrix and bias vector for layer $\ell$, and $\sigma$ is the activation function. the weight matrices and bias vectors are collectively called the parameters of the network, written $\theta$, and they are what get adjusted during training.

the reason you need the activation function $\sigma$ at all is that without it the whole construction collapses to a single linear map. composing linear functions gives you another linear function: $W^{(2)}(W^{(1)} x + b^{(1)}) + b^{(2)}$ is just $Ax + c$ for some matrix $A$ and vector $c$. a network of any depth without nonlinear activations can only represent linear relationships between inputs and outputs, which is useless for most real problems. the activation function is what gives the network the ability to represent curved, nonlinear mappings.

two activations appear in the experiments in this paper:

tanh: $\sigma(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$. This squashes any real number into the range $(-1, 1)$. It’s smooth and differentiable everywhere, which is useful for gradient-based optimization. It saturates for large $

$, meaning the output gets stuck near $-1$ or $+1$ and the gradient becomes very small, which can slow training.

ReLU (rectified linear unit): $\sigma(z) = \max(0, z)$. This outputs zero for any negative input and the input itself for any positive input. It’s computationally cheap, doesn’t saturate for positive values, and is widely used in practice. It has a kink at zero where the derivative is technically undefined, but in practice this is handled by choosing either 0 or 1 for the subgradient there. ReLU (rectified linear unit): $\sigma(z) = \max(0, z)$. this outputs zero for any negative input and the input itself for any positive input. it's computationally cheap, doesn't saturate for positive values, and is widely used in practice. it has a kink at zero where the derivative is technically undefined, but in practice this is handled by choosing either 0 or 1 for the subgradient there.

when the paper refers to architectures like MLP-64-64-Tanh, it means: a multilayer perceptron with two hidden layers each containing 64 neurons, using tanh activations. MLP-32-32-32-Tanh has three hidden layers of 32 neurons each.

Loss functions, MSE, and what convergence means

a loss function is a scalar-valued function of the network's parameters that measures how badly the network is currently performing on the task. during training you want to minimize the loss: lower loss means better predictions.

for regression problems, the standard loss function is mean squared error (MSE):

\[\mathcal{L}(\theta) = \frac{1}{n} \sum_{i=1}^n \left(g_\theta(x_i) - y_i\right)^2\]

where $g_\theta(x_i)$ is the network's prediction for input $x_i$ and $y_i$ is the true target value. you square the difference for two reasons: squaring makes the loss non-negative (positive and negative errors don't cancel each other out), and squaring penalizes large errors more than small ones because the function grows quadratically.

the population version of MSE, which is what you're really trying to minimize when you have infinite data, is:

\[\mathcal{L}(\theta) = E\left[(g_\theta(X) - Y)^2\right]\]

the best possible function to minimize this, taken over all functions not just neural networks, is $g^*(X) = E[Y \mid X]$, the conditional expectation of $Y$ given $X$. to see why: write $g(X) - Y = (g(X) - E[Y \mid X]) + (E[Y \mid X] - Y)$, square both sides, take expectations, and the cross term vanishes by the tower property of conditional expectation. you're left with $E[(g(X) - E[Y \mid X])^2] + E[(E[Y \mid X] - Y)^2]$. the second term is fixed, it's the conditional variance $\mathrm{Var}(Y \mid X)$ which doesn't depend on $g$. the first term is minimized to zero by setting $g = E[Y \mid X]$. so the optimal predictor is always the conditional mean, and the minimum achievable MSE is the irreducible noise variance.

convergence in the CCA framework has a specific operational definition: the held-out validation loss has dropped below a threshold $\tau$. “held-out” means you evaluate the network on a separate dataset it wasn't trained on, which gives an honest estimate of generalization performance rather than just measuring how well the network memorized training examples. the minimum achievable held-out MSE is the variance of the noise $\varepsilon$ in $Y = f(X) + \varepsilon$, because $\varepsilon$ is independent of $X$ and cannot be predicted from $X$ regardless of how good the network is.

What a loss curve is and how to read it

a loss curve is a plot of the loss value on the vertical axis against the training step number on the horizontal axis. it shows the trajectory of optimization over time.

a healthy loss curve starts high when the network is randomly initialized and making essentially random predictions, then decreases as gradient descent adjusts the parameters. the decrease is typically fast early in training when the network is far from any good solution, and slows as it approaches the minimum. this pattern corresponds to the geometric decay described by the Polyak-Lojasiewicz condition: the loss decreases as roughly $(1 - 2\eta\mu)^t$ times the initial gap.

a plateau in the loss curve means gradient descent has stopped making progress. this can happen because the network converged to the minimum, or because it got stuck in a flat region or saddle point of the loss landscape. in CCA, the distinction between these two cases is exactly what carries the causal signal: the forward direction converges (loss drops below $\tau$), while the reverse direction often plateaus above $\tau$ and never reaches the threshold within the training budget.

loss curves are often plotted with a logarithmic vertical axis. when the loss spans several orders of magnitude, say from 1.0 early in training down to 0.001 near convergence, a linear axis makes the early fast descent look like a vertical drop and compresses the later dynamics into a flat-looking region near zero. a log axis turns geometric decay into a straight line, which makes it much easier to read the convergence rate and see when training has actually stalled.

Gradient descent: the algorithm that trains neural networks

gradient descent is an iterative optimization algorithm. you start with random parameters $\theta_0$ and at each step you compute the gradient of the loss with respect to the current parameters, then take a step in the direction that decreases the loss:

\[\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)\]

$\eta$ is the learning rate, a positive scalar controlling how large each step is. the gradient $\nabla_\theta \mathcal{L}$ is a vector pointing in the direction of steepest increase of the loss, so moving in the negative gradient direction moves you toward lower loss. if $\eta$ is too large you overshoot the minimum and the parameters oscillate or diverge. if $\eta$ is too small the steps are tiny and training takes very long.

computing the gradient over the full training dataset at each step is exact but expensive when the dataset is large. mini-batch gradient descent uses a randomly sampled subset of size $m$ (a mini-batch) to estimate the gradient:

\[\hat{\nabla}_\theta \mathcal{L} \approx \frac{1}{m} \sum_{i \in \text{batch}} \nabla_\theta \left(g_\theta(x_i) - y_i\right)^2\]

this estimate is noisy because you're only seeing a fraction of the data, but for large datasets the savings in computation per step far outweigh the noise introduced. the noise actually has some benefits: it can help escape shallow local minima that full-batch gradient descent would get stuck in.

Adam (Adaptive Moment Estimation) is a more sophisticated optimizer that maintains a running estimate of both the first moment (mean) and the second moment (uncentered variance) of the gradient for each parameter, and uses these to normalize the effective step size:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta \mathcal{L}_t$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta \mathcal{L}_t)^2$ $\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

$m_t$ is the running mean of gradients, $v_t$ is the running mean of squared gradients, and $\hat{m}_t$, $\hat{v}_t$ are bias-corrected versions. the division by $\sqrt{\hat{v}_t}$ normalizes the step size per parameter: parameters with consistently large gradients get smaller effective learning rates, parameters with small or inconsistent gradients get larger effective learning rates. this makes Adam much less sensitive to the choice of $\eta$ than plain gradient descent, and it generally converges faster in practice.

RMSProp is similar to Adam: it normalizes the learning rate by the running mean of squared gradients, without the first moment term. the practical effect is comparable to Adam in many settings.

The Polyak-Lojasiewicz condition and why it matters for convergence rates

the Polyak-Lojasiewicz (PL) condition is a property of a loss landscape that guarantees gradient descent converges at a geometric rate. it says that if you're not at the minimum, your gradient must be large enough to make meaningful progress:

\[\|\nabla \mathcal{L}(\theta)\|^2 \geq 2\mu \left(\mathcal{L}(\theta) - \mathcal{L}^*\right)\]

$\mathcal{L}^*$ is the minimum achievable loss and $\mu > 0$ is the PL constant. reading this inequality: if your current loss is far above the minimum (the right side is large), then the squared gradient norm (the left side) must also be large. large gradient means gradient descent takes a meaningful step toward the minimum. the PL constant $\mu$ controls how tight this relationship is: larger $\mu$ means the gradient is more informative about how far you are from the minimum, which means faster convergence.

PL is strictly weaker than strong convexity. a strongly convex function has a unique global minimum and the loss surface curves upward everywhere, which guarantees PL. but PL allows for non-convex loss surfaces with multiple local minima, as long as the gradient is informative near minima. this matters because neural network loss surfaces are decidedly non-convex, but recent work has shown they satisfy PL locally near good solutions.

under the PL condition, gradient descent with learning rate $\eta$ converges geometrically:

\[E[\mathcal{L}(\theta_t) - \mathcal{L}^*] \leq (1 - 2\eta\mu)^t (\mathcal{L}_0 - \mathcal{L}^*) + \frac{\eta\sigma^2}{2\mu}\]

the first term is geometric decay: at step $t$, the initial gap $\mathcal{L}_0 - \mathcal{L}^$ has been reduced by the factor $(1 - 2\eta\mu)^t$. to halve the remaining gap you need roughly $\frac{\log 2}{2\eta\mu}$ steps, and to get within $\varepsilon$ of the minimum you need $\frac{1}{2\eta\mu} \log \frac{\mathcal{L}_0 - \mathcal{L}^}{\varepsilon}$ steps.

the second term $\frac{\eta\sigma^2}{2\mu}$ is an irreducible noise floor from mini-batch gradient estimation. $\sigma^2$ is the variance of the gradient estimator. no matter how many training steps you run, expected loss cannot go below this floor. decreasing the learning rate $\eta$ lowers this floor but also slows the geometric decay in the first term, so there is a fundamental tradeoff between speed and final accuracy.

the key observation for CCA: if the gradient noise $\sigma^2$ is structured, meaning the noise depends on position in parameter space rather than being independent random noise, then the effective $\sigma^2$ is larger and the noise floor is higher, requiring more steps to reach any given threshold. this is exactly what happens in the reverse direction, as Lemma 2 shows.

Statistical independence vs correlation: a distinction that matters

the Pearson correlation coefficient between two random variables $X$ and $Y$ is:

\[\rho_{XY} = \frac{\mathrm{Cov}(X, Y)}{\sigma_X \sigma_Y}\]

it ranges from $-1$ to $+1$. $+1$ means perfect positive linear relationship, $-1$ means perfect negative linear relationship, and $0$ means no linear relationship. crucially: zero correlation means no linear relationship, not no relationship of any kind.

independence is strictly stronger: $X \perp Y$ means that knowing the value of $X$ gives you literally no information about $Y$ in any sense, linear or nonlinear. formally, the joint distribution factors into the product of the marginals: $p(x, y) = p(x) \cdot p(y)$ for all values. independence implies zero correlation, but zero correlation does not imply independence.

the classic counterexample: let $X \sim \mathcal{N}(0,1)$ and $Y = X^2$. then $\mathrm{Cov}(X, Y) = E[X \cdot X^2] - E[X] \cdot E[X^2] = E[X^3] - 0 = 0$ because odd moments of a symmetric distribution around zero vanish. so $X$ and $Y$ are uncorrelated. but they are obviously not independent: if you know $Y = 4$, you know $X = \pm 2$, which is a lot of information about $X$.

this distinction is central to the CCA proofs. the ANM assumption $\varepsilon \perp X$ is a full independence condition, not just zero correlation. and Lemma 1's conclusion that reverse residuals remain correlated with $Y$ is specifically about covariance, which is the weaker condition, so it's actually a conservative result: what's proved is that even the weakest form of dependence, linear correlation, persists in the reverse direction.

Kolmogorov complexity and minimum description length

Kolmogorov complexity $K(x)$ is the length in bits of the shortest computer program that, when run on a universal Turing machine (a theoretical model of a general-purpose computer), outputs the string $x$ and halts. it's a measure of the intrinsic information content of $x$: how compressible is $x$?

a string of a billion zeros has very low Kolmogorov complexity because the program “print zero one billion times” is short. a truly random string of a billion bits has Kolmogorov complexity close to a billion because no shorter program exists that reproduces it: the string itself is essentially its own shortest description.

$K(x)$ is not computable: there is no algorithm that takes any string $x$ and correctly outputs $K(x)$ for all strings. this follows from the halting problem. despite being uncomputable in general, Kolmogorov complexity is the theoretically correct foundation for the idea that simpler explanations are better, because it formalizes “simpler” in a precise and universal way.

Minimum Description Length (MDL), introduced by Rissanen in 1978, is the practical approximation. given a model $M$ and data $D$, the MDL principle says to prefer the model that minimizes:

\[\mathrm{MDL}(M) = |M| + |D \mid M|\]

the first term is the number of bits needed to describe the model itself: more complex models cost more bits. the second term is the number of bits needed to describe the data given the model: better-fitting models compress the data more and cost fewer bits here. the optimal model balances both.

for Gaussian linear models, Rissanen showed MDL is approximately equivalent to the Bayesian Information Criterion: $\mathrm{BIC} = -2\log\hat{L} + k\log n$, where $\hat{L}$ is the maximized likelihood, $k$ is the number of parameters, and $n$ is the sample size. this makes MDL computationally tractable in the CCL framework.

the connection to causality comes through the Algorithmic Markov Condition (Janzing and Scholkopf 2010): in the true causal direction $X \to Y$, the cause distribution $P(X)$ and the mechanism $P(Y \mid X)$ are generated independently by nature, so they can be described independently and compressed separately. the total description length is short. in the reverse direction, $P(X \mid Y)$ has to describe how to invert a noisy nonlinear function, which requires knowing both the forward mechanism and the noise distribution jointly, and this entanglement makes the description longer. MDL naturally prefers the direction where the description is shorter, which is the causal direction.

Directed acyclic graphs and d-separation

a directed acyclic graph (DAG) is a collection of nodes and directed edges (arrows between nodes) with the constraint that you cannot follow arrows and return to the same node. “acyclic” rules out feedback loops. in causal modeling, each node represents a variable and an arrow $X \to Y$ means $X$ is a direct cause of $Y$ in the model.

d-separation is the graphical criterion for reading conditional independence from a DAG. two nodes $A$ and $B$ are d-separated given a set $C$ of conditioning variables if every path between $A$ and $B$ is blocked by $C$. a path is any sequence of nodes connected by edges regardless of direction. a path is blocked when:

for a non-collider on the path (a node where the path goes either $\leftarrow Z \rightarrow$ or $\rightarrow Z \rightarrow$): if $Z$ is in $C$, the path is blocked.

for a collider on the path (a node where the path goes $\rightarrow Z \leftarrow$, meaning two arrows point into $Z$): if neither $Z$ nor any descendant of $Z$ is in $C$, the path is blocked. if $Z$ or a descendant is in $C$, the path is open.

the collider rule has a counterintuitive consequence: conditioning on a collider opens a path that was previously blocked and creates a spurious association between the parents of the collider. this is called collider bias or Berkson's paradox. for example: being tall and being good at basketball are both causes of being selected for a professional basketball team. among NBA players (conditioning on the collider “selected”), height and basketball skill become negatively correlated even if they're independent in the general population, because a player who is less skilled needs to be taller to have been selected, and vice versa.

under the Markov condition and faithfulness, d-separation in the DAG exactly characterizes conditional independence in the probability distribution: $A$ and $B$ are conditionally independent given $C$ if and only if they are d-separated by $C$ in the graph. this is the bridge that lets you read off testable statistical implications from causal graph structure, and it's the foundation for algorithms like PC-stable that recover causal graphs from conditional independence tests on data.

What confounding is and why it corrupts observational studies

a confounder is a variable $Z$ that is a common cause of both the treatment variable $X$ and the outcome variable $Y$. the classic example: studies observe that people who carry lighters are more likely to develop lung cancer. lighters don't cause cancer. smoking causes both the habit of carrying a lighter and the development of lung cancer. smoking is the confounder. the association between lighters and cancer is entirely spurious.

mathematically, the problem is that $P(Y \mid X = x) \neq P(Y \mid \mathrm{do}(X = x))$ when there is a confounder. when you condition on $X = x$, you're selecting people who happen to have $X = x$, and they differ systematically from people with other values of $X$ in their distribution of $Z$ and everything else associated with $Z$. the observational distribution mixes the direct causal effect of $X$ on $Y$ with the indirect association through $Z$.

the backdoor adjustment formula, derived from Pearl's do-calculus, shows how to recover the true interventional distribution when all confounders are observed:

\[P(Y \mid \mathrm{do}(X = x)) = \int P(Y \mid X = x, Z = z) P(Z = z)\, dz\]

you condition on $Z$ to control for confounding, but you weight by the marginal $P(Z)$ rather than the conditional $P(Z \mid X = x)$. this removes the spurious pathway. the formula only works when you've measured all relevant confounders: unmeasured hidden confounders remain a source of bias that cannot be corrected from observational data alone.

the causal IB in the CCL framework replaces $I(T; Y)$ with $I_c(Y \mid \mathrm{do}(T))$ precisely to eliminate the confounding channel: the do-operator severs every arrow into $T$, so any association that existed only through shared upstream causes disappears, and only the direct causal contribution of $T$ to $Y$ remains.

Information entropy and mutual information

Shannon entropy $H(X)$ quantifies uncertainty about a random variable. for a discrete variable:

\[H(X) = -\sum_k P(X = k) \log P(X = k)\]

a fair coin has $H = -0.5\log 0.5 - 0.5\log 0.5 = \log 2 \approx 0.693$ nats using natural log, or 1 bit using log base 2. a biased coin with $P(\mathrm{heads}) = 0.9$ has $H \approx 0.325$ nats: less uncertainty because the outcome is more predictable. a deterministic outcome has $H = 0$: no uncertainty at all.

mutual information $I(X; Y)$ measures how much knowing $Y$ reduces your uncertainty about $X$:

\[I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)\]

$H(X \mid Y)$ is the conditional entropy: the average remaining uncertainty about $X$ after you know $Y$. mutual information is zero if and only if $X$ and $Y$ are independent, meaning $Y$ tells you nothing about $X$. it equals $H(X)$ if knowing $Y$ tells you everything about $X$. it can also be written as the KL divergence between the joint distribution and the product of marginals:

\[I(X; Y) = \int\!\!\int p(x, y) \log \frac{p(x, y)}{p(x)p(y)}\, dx\, dy\]

this is always non-negative and equals zero exactly when $p(x, y) = p(x)p(y)$, i.e. when $X$ and $Y$ are independent.

the information bottleneck tradeoff $\min_T I(X;T) - \beta I(T;Y)$ uses these quantities directly: $I(X;T)$ measures how much of the input $X$ is retained in the compressed representation $T$ (minimize this to compress aggressively), and $I(T;Y)$ measures how much information $T$ carries about the output $Y$ (maximize this to preserve predictive power). $\beta$ is the tradeoff parameter controlling which objective dominates.

Covariance matrices and why diagonal structure matters

when you have multiple random variables $X_1, \ldots, X_p$, their pairwise covariances form the covariance matrix:

\[\Sigma_{ij} = \mathrm{Cov}(X_i, X_j) = E[(X_i - E[X_i])(X_j - E[X_j])]\]

the diagonal entries $\Sigma_{ii} = \mathrm{Var}(X_i)$ are the individual variances. off-diagonal entries measure pairwise linear co-movement. the covariance matrix is symmetric ($\Sigma_{ij} = \Sigma_{ji}$) and positive semidefinite, meaning all its eigenvalues are non-negative, which is required for it to be a valid covariance matrix (no direction in variable space can have negative variance).

a diagonal covariance matrix has zeros everywhere off the diagonal: the variables are uncorrelated with each other. this is important in the CCL framework because the VAE encoder for the causal IB uses a diagonal Gaussian posterior, meaning the components of the latent representation $T$ are modeled as uncorrelated given the input. this structure prevents the encoder from collapsing two distinct causal parents $X_i$ and $X_j$ into a single entangled latent direction. if the encoder were allowed arbitrary covariance, it could find a representation where a single latent dimension mixes the effects of two different causal variables, which would destroy the correspondence between the representation's structure and the true causal graph.

Mean squared error vs absolute error and why squaring is the right choice here

you could define prediction error as $\lvert g_\theta(X) - Y \rvert$ (the absolute value, L1 loss) instead of $(g_\theta(X) - Y)^2$ (MSE, L2 loss). both are valid loss functions. they produce different optimal predictors and have different mathematical properties.

the optimal MSE predictor is $E[Y \mid X]$, the conditional mean. the optimal L1 predictor is the conditional median. for symmetric noise distributions these coincide, but for skewed noise they differ.

MSE is the right choice for CCA for three reasons. first, the optimal predictor $E[Y \mid X]$ has a clean connection to the ANM structure: under $Y = f(X) + \varepsilon$ with $\varepsilon \perp X$, the conditional mean is exactly $f(X)$, and the minimum MSE is exactly $\sigma_\varepsilon^2$. second, the law of total variance decomposes MSE exactly into a signal component and a noise component, which is the decomposition used in Lemma 2. third, the gradient of MSE is linear in the residual, which makes the analysis of gradient variance and the PL condition tractable.

Covariance matrices and why diagonal structure matters

when you have multiple random variables $X_1, \ldots, X_p$, their pairwise covariances form the covariance matrix:

\[\Sigma_{ij} = \mathrm{Cov}(X_i, X_j) = E[(X_i - E[X_i])(X_j - E[X_j])]\]

the diagonal entries $\Sigma_{ii} = \mathrm{Var}(X_i)$ are the individual variances. off-diagonal entries measure pairwise linear co-movement. the covariance matrix is symmetric ($\Sigma_{ij} = \Sigma_{ji}$) and positive semidefinite, meaning all eigenvalues are non-negative, which is required for it to represent a valid probability distribution (no direction in variable space can have negative variance).

a diagonal covariance matrix has zeros everywhere off the diagonal: the variables are uncorrelated with each other. this matters in the CCL framework because the VAE encoder for the causal IB uses a diagonal Gaussian posterior, meaning the components of the latent representation $T$ are modeled as uncorrelated given the input. this structure prevents the encoder from collapsing two distinct causal parents $X_i$ and $X_j$ into a single entangled latent direction. if the encoder were allowed arbitrary covariance, it could find a representation where a single latent dimension mixes the effects of two different causal variables, destroying the correspondence between the representation’s structure and the true causal graph. the faithfulness preservation theorem depends on this constraint holding.

What z-scoring does

z-scoring, also called standardizing, transforms a variable $X$ by subtracting its mean and dividing by its standard deviation:

\[\tilde{X} = \frac{X - \mu_X}{\sigma_X}\]

after this transformation, $E[\tilde{X}] = 0$ and $\mathrm{Var}(\tilde{X}) = 1$. the transformed variable is dimensionless. this matters enormously for CCA because the convergence threshold $\tau$ is set in terms of the prediction target's variance, and both networks need to be held to the same standard.

to see why this fails without z-scoring, consider $Y = X^3 + \varepsilon$ with $X \sim \mathcal{N}(0,1)$ and small noise. the variance of $X^3$ can be computed using the sixth moment of the standard normal: $E[X^6] = 5!! = 15$, so $\mathrm{Var}(X^3) \approx 15$. if the convergence threshold is $\tau = 0.05$, the forward network (predicting $Y$ from $X$) needs to reduce its loss below 0.05 in a space where the target has variance 15: that means explaining 99.7% of $Y$'s variance. the reverse network (predicting $X$ from $Y$) works in a space where the target has variance 1: it only needs to explain 95% of $X$'s variance. the reverse network has an asymmetrically easier task purely from scale differences, not from any causal structure, and CCA will wrongly predict $Y \to X$.

after z-scoring both variables to unit variance, $\tau = 0.05$ means “predict 95% of the variance” for both networks. the comparison is apples-to-apples, and the causal asymmetry becomes the dominant signal.

Gaussian moments and how to compute them

for a standard normal $X \sim \mathcal{N}(0, 1)$, the moments follow a clean pattern. odd moments are all zero because the distribution is symmetric around zero: $E[X^k] = 0$ for any odd $k$. even moments use the double factorial formula:

\[E[X^{2k}] = (2k-1)!! = (2k-1) \times (2k-3) \times \cdots \times 3 \times 1\]

where $!!$ denotes the double factorial (the product of every other integer down to 1). concretely: $E[X^2] = 1$, $E[X^4] = 3$, $E[X^6] = 15$, $E[X^8] = 105$.

the double factorial formula comes from Wick's theorem (also known as the Isserlis theorem) in probability theory: for a centered Gaussian, the expectation of any product of variables equals the sum over all ways to pair them up of the product of pairwise expectations. for $X^6 = X \cdot X \cdot X \cdot X \cdot X \cdot X$, the number of ways to pair six identical terms is $\frac{6!}{2^3 \cdot 3!} = 15$, and each pairing contributes $E[X^2]^3 = 1$, giving $E[X^6] = 15$.

this is used directly in the z-scoring analysis: $\mathrm{Var}(X^3) = E[X^6] - (E[X^3])^2 = 15 - 0 = 15$, establishing the scale mismatch problem for the cubic DGP without standardization.

Lipschitz functions and why smooth mechanisms matter

a function $f: \mathbb{R} \to \mathbb{R}$ is Lipschitz continuous with constant $L$ if:

\[\lvert f(x) - f(x') \rvert \leq L\lvert x - x' \rvert \quad \text{for all } x, x'\]

geometrically: the slope of $f$ never exceeds $L$ in magnitude anywhere. $\sin(x)$ is globally Lipschitz with constant 1 because $\lvert\cos(x)\rvert \leq 1$ everywhere. a function like $x^3$ is not globally Lipschitz because $\lvert(x^3)’\rvert = 3x^2$ grows without bound as $x \to \infty$, but it is Lipschitz on any bounded interval.

the Lipschitz condition on the mechanism $f$ in the ANM serves two roles in the proofs. first, it bounds how fast the forward regression target $f(X)$ can change as $X$ changes, which controls the variance of gradient estimates for the forward network: if $f$ is bounded in slope, the gradient of the forward loss at each data point doesn't blow up for large inputs. second, the Lipschitz constant of $f$ appears in bounding the PL constant $\mu$ for the forward objective, which determines the geometric convergence rate in Lemma 3.

Random variables, probability distributions, and the Gaussian

a random variable is a quantity whose value is determined by a random process. mathematically, it's a function from outcomes of a random experiment to numbers. rolling a die produces a discrete random variable that takes values in ${1, 2, 3, 4, 5, 6}$ each with probability $1/6$. measuring someone's height produces a continuous random variable.

a probability distribution describes the likelihood of each possible value. for discrete variables, you list $P(X = k)$ for each $k$. for continuous variables, you work with a probability density function $p(x)$ satisfying $p(x) \geq 0$ and $\int_{-\infty}^{\infty} p(x)\, dx = 1$. the probability of landing in any interval $[a, b]$ is $\int_a^b p(x)\, dx$.

the Gaussian (normal) distribution is the most important continuous distribution in statistics. its density is:

\[p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\]

$\mu$ is the mean (center of the bell) and $\sigma^2$ is the variance (width of the bell). the notation $X \sim \mathcal{N}(\mu, \sigma^2)$ means $X$ is drawn from this distribution. the standard normal is $\mathcal{N}(0, 1)$.

the Gaussian is ubiquitous because of the Central Limit Theorem: the average of many independent random variables, regardless of their individual distributions, converges to a Gaussian as the number of variables grows. noise in measurements is usually a sum of many small independent perturbations, so Gaussian noise is a physically natural model.

Expected value, variance, covariance, and the law of total variance

the expected value $E[X]$ is the probability-weighted average over all possible values. for continuous $X$: $E[X] = \int x \cdot p(x)\, dx$. expectation is linear: $E[aX + b] = aE[X] + b$ and $E[X + Y] = E[X] + E[Y]$ whether or not $X$ and $Y$ are independent.

variance measures spread around the mean:

\[\mathrm{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2\]

the second equality follows from expanding the square and using linearity. variance is always non-negative. zero variance means $X$ is a constant. the standard deviation $\sigma_X = \sqrt{\mathrm{Var}(X)}$ is variance expressed in the original units of $X$.

covariance measures linear co-movement:

\[\mathrm{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]\]

the second equality is used directly in Lemma 1. when $X \perp Y$, $E[XY] = E[X]E[Y]$, so $\mathrm{Cov}(X, Y) = 0$.

the law of total variance is used in Lemma 2:

\[\mathrm{Var}(X) = E[\mathrm{Var}(X \mid Y)] + \mathrm{Var}(E[X \mid Y])\]

this decomposes total variance into two pieces: the average within-group variance $E[\mathrm{Var}(X \mid Y)]$ (how uncertain you are about $X$ on average after conditioning on $Y$), and the variance of the group means $\mathrm{Var}(E[X \mid Y])$ (how much knowing $Y$ moves your prediction of $X$). these two components are non-negative and sum to the total. after z-scoring $X$ to unit variance, the minimum achievable reverse MSE is $E[\mathrm{Var}(X \mid Y)] = 1 - \mathrm{Var}(E[X \mid Y])$, which is strictly less than 1 whenever $X$ and $Y$ are dependent.

Conditional expectation and the tower property

$E[X \mid Y = y]$ is the mean of $X$ restricted to cases where $Y = y$: for continuous variables, $E[X \mid Y = y] = \int x \cdot p(x \mid Y = y)\, dx$. it's a function of $y$.

$E[X \mid Y]$ without fixing $y$ is the random variable you get by plugging in the actual random $Y$: for each realization of $Y$, it gives the conditional mean of $X$ at that value.

the tower property (law of iterated expectations): $E[E[X \mid Y]] = E[X]$. averaging the conditional expectation over all values of $Y$ gives the unconditional expectation. proof: $E[E[X \mid Y]] = \int \left(\int x \cdot p(x \mid y)\, dx\right) p(y)\, dy = \int \int x \cdot p(x, y)\, dx\, dy = E[X]$.

the consequence used in Lemma 1: for any function $g$, $E[g(Y) \cdot (X - E[X \mid Y])] = 0$. this is because conditioning on $Y = y$ and taking expectations gives $g(y) \cdot E[X - E[X \mid Y] \mid Y = y] = g(y) \cdot 0 = 0$, and averaging over $Y$ gives zero. in particular, $\mathrm{Cov}(g(Y), X - E[X \mid Y]) = 0$ for any function $g$: the optimal reverse predictor $h^(Y) = E[X \mid Y]$ has residuals $X - h^(Y)$ that are uncorrelated with any function of $Y$, including $Y$ itself.

What independence means

two random variables $X$ and $Y$ are independent if knowing the value of one gives you zero information about the other. formally: $P(X = x, Y = y) = P(X = x) \cdot P(Y = y)$ for all $x$ and $y$, meaning the joint distribution factors into the product of the marginals.

this is stricter than zero covariance. $\mathrm{Cov}(X, Y) = 0$ means no linear relationship, but $X$ and $Y$ could have zero covariance while still being dependent in a nonlinear way. let $X \sim \mathcal{N}(0,1)$ and $Y = X^2$: then $E[XY] = E[X^3] = 0$ by symmetry, so $\mathrm{Cov}(X, Y) = 0$, but knowing $Y = 4$ tells you $X = \pm 2$, which is clearly not zero information about $X$.

independence implies $\mathrm{Cov}(X, Y) = 0$, but not the other way around. the ANM uses noise independence $\varepsilon \perp X$ in the stronger sense: the full joint distribution factors. this is why forward residuals converging to uncorrelated noise is a clean story: uncorrelated noise that is also fully independent means structureless noise with no relationship to the input in any sense.

Derivatives, gradients, and the Jacobian

a derivative $f'(x)$ measures the instantaneous rate of change of $f$ at $x$: how much does the output change per unit change in the input? for $f(x) = x^3$, $f'(x) = 3x^2$: the function is changing at rate $3x^2$ at input $x$.

for a function of multiple inputs $L(w_1, w_2, \ldots)$, the partial derivative $\frac{\partial L}{\partial w_1}$ is the rate of change with respect to $w_1$ while holding all other inputs fixed. the gradient $\nabla L$ is the vector of all partial derivatives:

\[\nabla L(w_1, w_2, \ldots) = \left(\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \ldots\right)\]

the gradient points in the direction of steepest increase of $L$. to minimize $L$, you move in the direction $-\nabla L$.

the gradient norm $|\nabla_\theta \mathcal{L}| = \sqrt{\sum_i (\partial \mathcal{L} / \partial \theta_i)^2}$ measures how steeply the loss is changing with respect to parameters. large norm means you're far from flat ground, small norm means you're near a critical point.

the Jacobian generalizes the gradient to vector-valued functions: for $g: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J$ is the $m \times n$ matrix $J_{ij} = \partial g_i / \partial x_j$. for a scalar-output neural network $g_\theta: \mathbb{R} \to \mathbb{R}$, the Jacobian with respect to the parameters $\theta$ is just $\nabla_\theta g_\theta$. the Jacobian norm $\lVert J \rVert$ measures how much a small change in inputs stretches the outputs. the gradient variance analysis in Lemma 2 depends critically on Jacobian norms: for the cubic DGP $Y = X^3$, the Jacobian of the target function is $f’(x) = 3x^2$, which can be large for large $\lvert x \rvert$, and this competes with the residual correlation effect in determining which network has larger gradient variance.

The additive noise model: the framework and why injectivity is non-negotiable

CCA operates within the Additive Noise Model (ANM) framework. the data generating process is:

\[Y = f(X) + \varepsilon\]

$Y$ is determined by applying mechanism $f$ to $X$, then adding noise $\varepsilon$ that is statistically independent of $X$. two variables, one mechanism, one independent noise term.

the history: Hoyer, Janzing, Mooij, Peters, and Scholkopf introduced nonlinear ANMs at NeurIPS 2008 and showed that for nonlinear $f$, generic models generate joint distributions that don't admit a valid ANM in the reverse direction. Peters, Mooij, Janzing, and Scholkopf then provided the full formal identifiability proof in JMLR 2014: under the ANM with nonlinear $f$ and $\varepsilon \perp X$, the causal direction is identifiable from the joint distribution almost everywhere. the result is that you almost always cannot write an equivalent ANM $X = g(Y) + \eta$ with $\eta \perp Y$: the reverse ANM doesn't exist except on a measure-zero set of mechanisms.

CCA requires three conditions on the ANM. first, $f$ must be nonlinear: for linear $f$, the forward and reverse problems are mirror images (proved by Hoyer et al. 2009, confirmed by 0/30 accuracy on the linear Gaussian experiments). second, $f$ must be Lipschitz continuous: $\lvert f(x) - f(x’) \rvert \leq L_f \lvert x - x’ \rvert$ for some finite constant $L_f$. third and most consequential, $f$ must be injective: $f(x_1) = f(x_2) \implies x_1 = x_2$, meaning distinct inputs always produce distinct outputs before noise is added.

the noise independence $\varepsilon \perp X$ is the structural encoding of causal direction. in the true causal direction, the noise that perturbs $Y$ is generated by the world independently of what $X$ is. once $Y = f(X) + \varepsilon$ is formed, the information about which specific $X$ and $\varepsilon$ produced this $Y$ is entangled inside $Y$. trying to write $X = g(Y) + \eta$ with $\eta \perp Y$ is trying to disentangle something that got mixed together by the forward process, and for injective nonlinear $f$ this disentanglement is provably impossible except in degenerate cases.

why injectivity matters concretely: consider $Y = X^2 + \varepsilon$ with $X \sim \mathcal{N}(0,1)$. $f(x) = x^2$ maps both $x = 1$ and $x = -1$ to $f = 1$. because the input distribution is symmetric around zero and $f$ folds both sides to the same value, the optimal reverse predictor is $E[X \mid Y = y] = 0$ for all $y$ (by symmetry, knowing $Y$ tells you $\lvert X \rvert \approx \sqrt{y}$ but not the sign, so the mean is zero). a neural network learning to predict $X$ from $Y$ converges to outputting zero everywhere in under 25 training steps, because predicting the constant zero is the optimal solution for this symmetric non-injective problem. the forward network meanwhile needs hundreds of steps to learn the curved function $x \mapsto x^2$. CCA sees $T_\mathrm{fwd} \approx 1000$ and $T_\mathrm{rev} \approx 10$, scores strongly positive, and predicts $Y \to X$, the wrong direction. this failure is completely predicted by the theory and confirmed 0/10 in the experiments.

What existed before CCA and where each method breaks

three major families of methods existed before CCA. understanding where each fails helps place the new contribution.

RESIT (regression with subsequent independence test), Peters et al. 2014. fit a nonlinear regression of $Y$ on $X$ to get $\hat{f}$, compute residuals $\hat{\varepsilon}_i = Y_i - \hat{f}(X_i)$, and test whether the residuals are statistically independent of $X$ using HSIC (the Hilbert-Schmidt Independence Criterion, a kernel-based test for any kind of dependence). in the true causal direction $X \to Y$, residuals converge to $\varepsilon$ as the fit improves, and $\varepsilon \perp X$ by the ANM assumption, so the test should pass. in the reverse direction $Y \to X$, the residuals retain structure, so the test should fail. you run RESIT in both directions and pick the direction where independence holds.

where it breaks: Peters et al. proved this themselves in Proposition 23 of their 2014 paper. when $f$ is not injective, both directions can produce residuals that appear approximately independent of the input by symmetry, and the test gives no usable signal. on the Tubingen cause-effect pairs benchmark, RESIT achieves 63% accuracy, which is 9 percentage points below the trivial majority-class baseline of 72.2%. that is not a criticism of the researchers, it means real-world mechanisms don't always satisfy the identifiability assumptions cleanly.

IGCI (information geometric causal inference), Janzing and Scholkopf 2010. this method operates at the level of Kolmogorov complexity. the Algorithmic Markov Condition states that in the true causal direction, the total Kolmogorov complexity of the joint distribution is smaller than in the reverse direction: $K(P(X)) + K(P(Y \mid X)) < K(P(Y)) + K(P(X \mid Y))$. IGCI approximates this using entropy-based proxies (since Kolmogorov complexity is uncomputable) and scores each direction by which has the shorter approximate description. the signal can be weak for many functional forms, and the entropy approximations lose accuracy for distributions far from uniform or Gaussian. IGCI achieves roughly 60% on Tubingen.

SkewScore, Lin et al. 2025 (ICLR). this method uses the score function $s(x) = \nabla_x \log p(x)$, which is the gradient of the log-density of $x$, and analyzes the skewness of this function as a causal direction signal. it's specifically designed for heteroscedastic noise, meaning noise whose variance depends on the input: $Y = f(X) + \sigma(X)\varepsilon$. in the heteroscedastic case, the score function of the anticausal distribution has detectably different skewness properties than the causal one. SkewScore handles cases where RESIT fails due to heteroscedasticity, operating on distributional statistics rather than residual independence tests.

CCA uses none of these signals. it measures how many gradient steps a neural network needs to reach a target validation loss. the proof that this signal exists and is reliable comes from first principles in optimization theory, not from pattern-matching in benchmark performance.

The three lemmas and the asymmetry theorem

Lemma 1: reverse residuals remain correlated with the input for any finite-capacity approximation.

in the forward direction, network $g_\theta$ learns to predict $Y$ from $X$. as training progresses and $g_\theta \to f$, the residuals are $R_\mathrm{fwd} = Y - g_\theta(X) = f(X) + \varepsilon - g_\theta(X) \to \varepsilon$. since $\varepsilon \perp X$ by the ANM assumption, forward residuals become uncorrelated with the input as the network approaches the true function.

in the reverse direction, network $h_\phi$ learns to predict $X$ from $Y$. the population-optimal predictor is $h^(Y) = E[X \mid Y]$. for any finite-capacity network $h_\phi \neq h^$, define the approximation error $\delta_\phi(Y) = h_\phi(Y) - h^*(Y)$. the residual decomposes as:

\[R_\mathrm{rev} = X - h_\phi(Y) = (X - h^*(Y)) - \delta_\phi(Y) = R^* - \delta_\phi(Y)\]

by the tower property, $E[g(Y) \cdot R^] = E[g(Y) \cdot (X - E[X \mid Y])] = 0$ for any function $g$, so $\mathrm{Cov}(R^, Y) = 0$. therefore:

\[\mathrm{Cov}(R_\mathrm{rev}, Y) = \mathrm{Cov}(R^*, Y) - \mathrm{Cov}(\delta_\phi(Y), Y) = -\mathrm{Cov}(\delta_\phi(Y), Y)\]

since $h_\phi \neq h^*$, the approximation error $\delta_\phi(Y)$ is a non-constant function of $Y$. under the full-support condition on $P(X)$, any non-constant measurable function of $Y$ has nonzero covariance with $Y$. therefore $\mathrm{Cov}(R_\mathrm{rev}, Y) \neq 0$ for any finite approximation.

this is not a transient property of early training that goes away as the network improves. it is a permanent structural property of the reverse regression problem. the residuals in the reverse direction are correlated with the input at every stage of training, because any approximation error you make is a function of $Y$, and functions of $Y$ are correlated with $Y$.

Lemma 2: that persistent correlation creates two specific problems in the optimization landscape.

the first effect is a higher irreducible loss floor. after z-scoring both variables:

\[\mathcal{L}^*_\mathrm{fwd} = \sigma_\varepsilon^2 = \mathrm{Var}(Y) - \mathrm{Var}(f(X)) < 1\] \[\mathcal{L}^*_\mathrm{rev} = E[\mathrm{Var}(X \mid Y)] = 1 - \mathrm{Var}(E[X \mid Y])\]

since $X$ and $Y$ are dependent, $E[X \mid Y]$ is non-constant, so $\mathrm{Var}(E[X \mid Y]) > 0$, so $\mathcal{L}^*_\mathrm{rev} < 1$. both minima are below 1. but for nonlinear injective $f$, the reverse minimum is heteroscedastic: $\mathrm{Var}(X \mid Y = y)$ varies with $y$ because the mechanism has uneven curvature across its domain. in regions where $f$ has high curvature, nearby $X$ values produce similar $Y$ values and the conditional variance is high: knowing $Y$ doesn't narrow down $X$ much. in regions where $f$ has large derivative and behaves nearly monotonically, each $Y$ value pins down $X$ precisely and the conditional variance is low. this non-uniform noise floor means the reverse network faces a landscape where it can't achieve uniformly small residuals everywhere simultaneously.

the second effect is non-separable gradient covariance. for mini-batch gradient estimates to average down as $O(1/m)$ with batch size $m$, the squared residual and the squared gradient norm need to be uncorrelated across data points. in the forward direction, as the network approaches $f$, residuals converge to $\varepsilon$ which is independent of $X$, while gradients $\nabla_\theta g_\theta(X)$ depend on $X$. independence of $\varepsilon$ from $X$ means residual magnitude is uncorrelated with gradient magnitude, factorization holds, and batch estimates average down cleanly.

in the reverse direction, $R_\mathrm{rev}$ is correlated with $Y$ throughout training by Lemma 1, and $\nabla_\phi h_\phi(Y)$ also depends on $Y$, so:

\[\mathrm{Cov}(R_\mathrm{rev}^2, \|\nabla_\phi h_\phi(Y)\|^2) \neq 0\]

factorization fails. the gradient noise has a structural floor that persists regardless of batch size, because increasing the batch size doesn't remove the correlation between residuals and gradients, it just gives you more samples from the same correlated distribution.

there is a counterintuitive result in the gradient variance experiments that needs addressing directly. measuring the reverse-to-forward gradient variance ratio at a single training step gives:

DGP	ratio at initialization	ratio at mid-training
$Y = X^3 + \varepsilon$ (z-scored)	0.376	0.054
$Y = X^3 + \varepsilon$ (not z-scored)	0.003	0.000
$Y = \sin(X) + \varepsilon$	1.161	2.179
$Y = e^{0.5X} + \varepsilon$	0.684	0.276
$Y = X^2 + \varepsilon$	0.730	0.608

for the cubic DGP, the ratio is below 1. this seems to say the reverse direction has smaller gradient variance, which appears to contradict Lemma 2. it does not. Lemma 2 is a claim about landscape structure: the reverse minimum is higher and the gradient noise floor is structural. it is not a claim that instantaneous gradient variance is larger in the reverse direction at every training step.

the explanation is a competing mechanism. the forward network is learning $x \mapsto x^3$, a steeply nonlinear function with Jacobian $f'(x) = 3x^2$ which grows without bound. large Jacobian norms produce large gradient variance. the reverse network is fitting $E[X \mid Y] \approx 0$, a near-constant function for a symmetric input distribution, which has small Jacobian norms and therefore small gradient variance. so instantaneous gradient variance is dominated by Jacobian magnitude rather than the residual correlation structure for this specific DGP.

the sine DGP clarifies the picture: the ratio grows from 1.161 at initialization to 2.179 at mid-training. for the sine, $E[X \mid Y]$ requires averaging over multiple branches of the inverse sine, producing a non-trivial multi-valued function with genuinely large Jacobians in the reverse network. the non-separable covariance then dominates over the Jacobian effect, and the ratio exceeds 1.

which mechanism dominates depends on the specific DGP. the correct empirical proxy for landscape difficulty is not instantaneous gradient variance at one step, it is convergence time: how many steps are needed to reach the threshold. the 19-fold gap on the cubic (forward at step 161, reverse capped at 3000) is the real signal, and no single gradient variance snapshot captures it.

Lemma 3: harder optimization landscape means more expected steps to converge.

under the PL condition, gradient descent converges geometrically and the step count to reach threshold $\tau$ satisfies:

\[E[T] \geq \frac{1}{2\eta\mu} \log \frac{\mathcal{L}_0 - \mathcal{L}^*}{\tau - \mathcal{L}^* - \frac{\eta\sigma^2}{2\mu}}\]

two conditions generate a step count gap between forward and reverse. condition (a): if the threshold $\tau$ falls between the two minimum losses, $\mathcal{L}^_\mathrm{fwd} < \tau < \mathcal{L}^\mathrm{rev}$, then the forward network can reach $\tau$ but the reverse network never can, regardless of training time. the reverse network’s noise floor lies above $\tau$ and $T\mathrm{rev} = T_\mathrm{max}$. condition (b): when $\tau$ is above both minima, the reverse network still needs more steps because it starts with a larger effective gap (since $\mathcal{L}^_\mathrm{rev} > \mathcal{L}^_\mathrm{fwd}$) and the non-separable gradient covariance inflates the effective $\sigma^2$ in the noise floor, pushing the optimizer away from $\tau$ more persistently.

the theoretical lower bound on the gap is $\Omega(1/\eta\mu)$. empirically the gap is much larger: on $Y = X^3 + \varepsilon$ with z-scoring, forward converges at step 161 and reverse doesn’t converge within 3000 steps, a roughly 19-fold difference. saddle points and flat regions in the reverse landscape, which local PL analysis doesn’t capture globally, amplify the gap substantially.

the CCA asymmetry theorem. under the ANM with nonlinear injective Lipschitz $f$, $\varepsilon \perp X$, and z-scored variables, if both objectives satisfy the PL condition locally near their minima:

\[E[T_\mathrm{fwd}] < E[T_\mathrm{rev}]\]

proof chain: Lemma 1 establishes that reverse residuals remain correlated with the input for any finite approximation. Lemma 2 shows this correlation creates a higher irreducible loss floor and non-separable gradient covariance in the reverse direction, making that optimization landscape structurally harder. Lemma 3 shows that a harder landscape, formalized through the PL condition, requires more expected gradient steps to reach any fixed threshold. composing these three: forward converges in strictly fewer expected steps than reverse. $\square$

z-scoring: the preprocessing step that determines whether the method works at all

z-scoring is required for CCA, and the experiments show what happens without it.

the problem is scale. consider $Y = X^3 + \varepsilon$ with $X \sim \mathcal{N}(0,1)$ and small $\varepsilon$. $\mathrm{Var}(Y) \approx \mathrm{Var}(X^3) = E[X^6] - (E[X^3])^2 = 15 - 0 = 15$. set convergence threshold $\tau = 0.05$ in terms of absolute MSE:

forward network predicts $Y$: needs MSE below 0.05 where target variance is 15, meaning it must explain $\frac{15 - 0.05}{15} \approx 99.7\%$ of $Y$'s variance. an enormously difficult task.
reverse network predicts $X$: needs MSE below 0.05 where target variance is 1, meaning it must explain 95% of $X$'s variance. a much easier task.

the reverse network wins not because of any causal signal but because its task is scaled fifteen times easier. this is pure scale artifact.

after z-scoring both variables to unit variance, $\tau = 0.05$ means “explain 95% of the variance” for both networks, and the comparison is symmetric. the causal asymmetry drives the result instead of scale.

the numbers from 30 seeds on the cubic DGP:

	without z-scoring	with z-scoring
mean $T_\mathrm{fwd}$	$1152.5 \pm 904.0$	$323 \pm 531$
mean $T_\mathrm{rev}$	$468.8 \pm 860.4$	$717 \pm 789$
correct predictions	6/30	26/30

without z-scoring: 6/30 correct, worse than random. with z-scoring: 26/30. the scale artifact is so strong that it reliably produces wrong answers without standardization.

after z-scoring, the minimum losses become dimensionless fractions of unit variance:

\[\mathcal{L}^*_\mathrm{fwd} = \frac{\sigma_\varepsilon^2}{\mathrm{Var}(Y)} \qquad \mathcal{L}^*_\mathrm{rev} = \frac{E[\mathrm{Var}(X \mid Y)]}{\mathrm{Var}(X)}\]

both are now comparable quantities, and the structural asymmetry from Lemmas 1 and 2 becomes the dominant signal.

The CCA algorithm

two real-valued variables $X$ and $Y$, both z-scored. two MLPs initialized with the same architecture, optimizer, hyperparameters, and random seed. one network $g_\theta$ predicts $Y$ from $X$, one network $h_\phi$ predicts $X$ from $Y$. both train via mini-batch gradient descent with MSE loss, learning rate $\eta$, batch size $m$, with parameters initialized so $|\theta|_2, |\phi|_2 \leq B$.

training stops when held-out validation MSE drops below threshold $\tau > 0$, capped at $T_\mathrm{max}$ steps if the threshold is never reached. write $T_\mathrm{fwd}$ for the step count of the forward network and $T_\mathrm{rev}$ for the reverse. if a network doesn't converge, its step count is $T_\mathrm{max}$.

the CCA score for the edge $X \to Y$:

\[\mathrm{CCA}(X \to Y) = T_\mathrm{fwd} - T_\mathrm{rev}\]

negative score means the forward direction converged faster, so predict $X \to Y$. positive score means the reverse converged faster, so predict $Y \to X$. the magnitude of the score reflects confidence: a score of $-2839$ is a strong signal, a score of $-12$ is weak.

for a causal graph over multiple variables, you run bivariate CCA on each candidate edge and sum:

\[\mathrm{CCA}(G) = \sum_{(i,j) \in G} \left[T_\mathrm{fwd}^{(i,j)} - T_\mathrm{rev}^{(i,j)}\right]\]

this aggregate score feeds into the XGES graph search algorithm as a bonus on top of the MDL score for each candidate edge orientation.

The canonical experiment

$Y = X^3 + \varepsilon$ with $X \sim \mathcal{N}(0,1)$, $\varepsilon \sim \mathcal{N}(0, 0.1)$, $n = 500$ samples, both variables z-scored. MLP-64-64-Tanh, Adam optimizer at learning rate 0.001, $\tau = 0.05$, $T_\mathrm{max} = 3000$.

seed 0: the forward network converges at step 161, the reverse network hits the cap at step 3000 without converging. CCA score: $161 - 3000 = -2839$, strong prediction of $X \to Y$.

the forward loss curve shows smooth geometric decay: rapid initial descent when the network is far from the function and gradients are large, slowing as it approaches the minimum, crossing $\tau = 0.05$ at step 161. the reverse loss curve descends initially, then plateaus just above $\tau$ around step 300 and stays there until the cap. this plateau is the non-separable gradient covariance in action: the optimizer reaches the vicinity of $\mathcal{L}^*_\mathrm{rev}$, and position-dependent residual correlation prevents clean progress toward $\tau$.

six architectures across five DGPs and five seeds each:

architecture	$Y = \sin(X)$	$Y = e^{0.5X}$	$Y = X^3$ (z-scored)	$Y = X^2$ (boundary)	linear Gaussian
MLP-64-64-Tanh / Adam	5/5	5/5	4/5	BC	0/5
MLP-128-128-Tanh / Adam	5/5	5/5	4/5	BC	0/5
MLP-32-32-32-Tanh / Adam	5/5	5/5	3/5	BC	0/5
MLP-64-64-ReLU / Adam	5/5	5/5	3/5	BC	0/5
MLP-64-64-Tanh / SGD	5/5	5/5	5/5	BC	0/5
MLP-64-64-Tanh / RMSProp	5/5	5/5	4/5	BC	0/5
total	30/30	30/30	23/30	BC	0/30

sine and exponential: 30/30 across all six architectures. this robustness is meaningful: tanh vs ReLU activations, widths of 32 vs 128 neurons, depths of 2 vs 3 layers, Adam vs SGD vs RMSProp, none of it changes the result. the convergence asymmetry is a property of the objective function, not of the optimization algorithm. the architecture robustness on the sine and exponential DGPs is the most important robustness result in the paper. six different architectures, three different optimizers, and the answer doesn’t change. this is evidence that the convergence asymmetry is a property of the optimization landscape geometry, not of any particular implementation choice. the four failures on the cubic DGP with z-scoring are also within expected variance: the theorem guarantees $E[T_\mathrm{fwd}] < E[T_\mathrm{rev}]$ in expectation across seeds, not on every individual seed. a seed that initializes the forward network in a high-curvature region of $X^3$ can take unusually many steps from that starting point without contradicting the theorem.

cubic with z-scoring: 23/30. some seeds fail because even with standardization, $E[X \mid Y]$ for a symmetric cubic input is near zero for a symmetric $P(X)$, so the reverse minimum is low and the landscape is rough in ways that depend on initialization.

linear Gaussian: 0/30. Hoyer et al. (2009) proved this is unavoidable: for linear Gaussian ANMs, both the forward and reverse problems have identical mathematical structure, the joint distribution is symmetric with respect to direction, and no asymmetry can be detected.

$Y = X^2$: 0/10, systematically wrong:

seed	$T_\mathrm{fwd}$	$T_\mathrm{rev}$	result
0	1000	9	wrong
1	1000	8	wrong
2	1000	20	wrong
3	1000	21	wrong
4	1000	9	wrong
5	1000	22	wrong
6	1000	25	wrong
7	1000	8	wrong
8	1000	23	wrong
9	1000	8	wrong

every seed: forward hits the cap at step 1000, reverse converges in 8 to 25 steps. the reverse collapses to predicting zero because that is the optimal solution for a symmetric non-injective mechanism.

a correction on the $Y = X^2$ result: the paper records 30/30 on the non-injective boundary condition column, which looks like a success but is the opposite. the reverse network wins all 30 seeds by collapsing to zero, giving a strong negative CCA score that predicts $Y \to X$, the wrong direction. the paper’s own footnote flags this: “non-injective BC: reverse target collapses to zero, appears correct by sign but for wrong reasons.” it is a 30/30 systematic failure in the wrong direction, not a 10/10 failure as written in the blog.

Experiment 4: what the gradient variance data shows

Lemma 2 claims the reverse direction has a harder optimization landscape: higher population minimum, heteroscedastic noise floor, non-separable gradient covariance. experiment 4 measures gradient norm variance directly to test this. the ratio measured is $\sigma^2{\nabla, \mathrm{rev}} / \sigma^2{\nabla, \mathrm{fwd}}$, the reverse-to-forward gradient variance ratio, at initialization and after 200 training steps.

DGP	ratio at initialization	ratio at mid-training
$Y = X^3 + \varepsilon$ (z-scored)	0.376	0.054
$Y = X^3 + \varepsilon$ (not z-scored)	0.003	0.000
$Y = \sin(X) + \varepsilon$	1.161	2.179
$Y = e^{0.5X} + \varepsilon$	0.684	0.276
$Y = X^2 + \varepsilon$	0.730	0.608

the cubic DGP shows a ratio below 1 at both phases, even with z-scoring. this initially looks like it contradicts Lemma 2, but the claim in Lemma 2 is about landscape structure: higher minimum loss and a structural noise floor. it is not a claim that instantaneous gradient variance is larger in the reverse direction at every training step.

what’s happening with the cubic is a competing mechanism. the forward network is learning $x \mapsto x^3$, a steeply nonlinear function with Jacobian $f’(x) = 3x^2$, which grows without bound for large $\lvert x \rvert$. large Jacobian norms produce large gradient variance directly, because small changes in parameters produce large changes in outputs over that steep slope. the reverse network is fitting $E[X \mid Y] \approx 0$, a near-constant function by symmetry of the input distribution, which has small Jacobian norms and therefore small gradient variance regardless of the residual correlation structure. so for this specific DGP, instantaneous gradient variance is dominated by Jacobian magnitude, not residual correlation.

the sine DGP is the case that clarifies the picture. the ratio grows from 1.161 at initialization to 2.179 at mid-training. for the sine, the reverse target $E[X \mid Y]$ involves averaging over multiple branches of the inverse sine, producing a genuinely non-trivial multi-valued function with large Jacobians in the reverse network. the non-separable covariance then dominates over the Jacobian effect, and the ratio exceeds 1 and grows over training.

the takeaway is that convergence time is the right experimental proxy for landscape asymmetry, not instantaneous gradient statistics at a single snapshot. the 19-fold convergence time gap on the cubic (forward converges at step 161, reverse capped at 3000) is the signal the theory is about. no single gradient variance measurement captures it because gradient variance at one step conflates the Jacobian structure of the specific function being learned with the residual correlation structure that Lemma 2 describes.

The Tubingen benchmark: 96% on real-world data

the Tubingen cause-effect pairs dataset was assembled by the Max-Planck-Institute for Biological Cybernetics. it contains 108 real-world variable pairs with known causal directions from 37 domains: altitude and temperature at German weather stations, age and height, ozone and UV radiation, brain weight and body weight, age and blood pressure, barometric pressure and precipitation, horsepower and fuel consumption, and more. ground truth was established from domain knowledge and known physical or biological mechanisms, not from statistics. it has been the standard benchmark for bivariate causal discovery since Mooij et al. published the definitive evaluation in JMLR 2016.

CCA setup: both variables z-scored, MLP-64-64-Tanh/Adam, $\tau = 0.05$, $T_\mathrm{max} = 10{,}000$ steps, 5 random seeds per pair, mean CCA score as the direction signal. the cap was extended to 10,000 from the 3,000 used on synthetic data because real-world mechanisms are messier and noisier, and some pairs need more steps before the asymmetry is detectable.

method	accuracy
CCA (this paper)	96%
majority-class baseline	72.2%
ANM / RESIT (Mooij et al. 2016)	63%
IGCI (Janzing & Scholkopf 2010)	~60%
chance	50%

96% on 108 pairs from one hyperparameter configuration: MLP-64-64-Tanh/Adam with $\tau = 0.05$ and $T_\mathrm{max} = 10{,}000$. this is 24 points above the trivial majority baseline and 33 points above RESIT, which fails below baseline. wrong predictions cluster at CCA scores near zero, at near-linear mechanisms, and at near-symmetric marginal distributions, exactly where the theory predicts the signal should be weakest.

one important caveat: this result comes from a single hyperparameter configuration with no ablation over $\tau$ and $T_\mathrm{max}$ on the benchmark itself. real-world mechanisms vary in complexity in ways synthetic DGPs don't, and how sensitive the 96% figure is to the choice of threshold and training cap on varied real-world data is not characterized. a proper sensitivity analysis would sweep these values. the result is strong and the failure pattern is theoretically coherent, but the hyperparameter sensitivity question remains open.

Going beyond bivariate: the CCL framework

CCA answers the bivariate direction question. the CCL (Causal Compression Learning) framework embeds CCA in a larger pipeline that targets the full problem of learning a causal graph, building a causally faithful compressed representation, and optimizing a policy that reasons about interventions. understanding why CCL needs all four components requires seeing where each component fails alone.

MDL alone compresses spurious correlations. Kolmogorov complexity $K(x)$ is the length of the shortest program outputting $x$ on a universal Turing machine. Solomonoff (1964) showed the optimal prior over hypotheses weights each by $2^{-K(h)}$: shorter descriptions are exponentially more probable. Rissanen (1978) gave the computable approximation: minimize $\lvert M \rvert + \lvert D \mid M \rvert$. the Algorithmic Markov Condition says the true causal factorization $P(X)P(Y \mid X)$ has shorter total Kolmogorov complexity than $P(Y)P(X \mid Y)$.

the failure mode: when a hidden confounder $Z$ causes both $X$ and $Y$ with no direct $X \to Y$ edge, the shortest description of the joint distribution $P(X, Y)$ captures the $X$-$Y$ correlation. MDL sees that correlation and compresses it by inferring a direct edge $X \to Y$, even though the association comes entirely through $Z$. MDL without a confounding-aware causal structure learns the wrong graph.

PAC learning alone certifies statistical error, not causal error. Valiant's 1984 PAC framework says an algorithm PAC-learns a hypothesis class $\mathcal{H}$ if with probability $1-\delta$ over $n$ samples it outputs $\hat{h}$ with error at most $\varepsilon$ when:

\[n \geq C \cdot \frac{d_\mathrm{VC}(\mathcal{H}) + \log(1/\delta)}{\varepsilon^2}\]

VC dimension $d_\mathrm{VC}$ counts the largest set of points the class can classify in all $2^m$ possible ways. it measures statistical capacity, not causal fidelity. a model that predicts $Y$ from $X$ through a hidden confounder $Z$ has the same VC dimension as one using the true direct mechanism: both achieve the same statistical accuracy on training data. when you intervene and force $X$ to change (severing the confounder's influence), the confounded model fails completely despite satisfying the PAC bound. Theorem D (linear complexity under the Markov condition): the Markov condition gives a product-form g-formula for the causal effect, meaning the joint distribution factorizes over edges. as a result, the causal risk gap decomposes additively over edges:

\[R_{\mathrm{causal}}(\pi_{\mathrm{CCL}}) - R^*_{\mathrm{causal}} \leq \frac{1}{1-\gamma} \sum_{e \in E_G} \delta_e\]

where $\delta_e = \lVert P(X_j \mid \mathrm{pa}j) - \hat{P}(X_j \mid \mathrm{pa}_j) \rVert{\mathrm{TV}}$ is the estimation error on edge $e$. an error on one edge affects only its local conditional distribution and doesn’t compound multiplicatively across all variables. this additive decomposition is why sample complexity scales with $d_c(G)$ linearly rather than combinatorially with the number of variables.

CCL replaces $d_\mathrm{VC}$ with $d_c(G)$, the number of edges in the minimal I-map of the causal graph: the smallest subgraph preserving all conditional independences entailed by the true graph. the causal PAC bound:

\[n \geq C \cdot \tau_\mathrm{mix} \cdot d_c(G) \cdot \log(d_c(G)/\delta) \cdot (1-\gamma)^{-3} \cdot \varepsilon^{-2}\]

$\tau_\mathrm{mix}$ is the Markov chain mixing time of the policy trajectory. $(1-\gamma)^{-3}$ compounds three RL-specific penalties. $d_c(G)$ replaces VC dimension: sparser causal graphs require less data to learn.

Pearl's causal framework alone has no unique graph. the do-calculus is complete (Shpitser and Pearl 2006): if it can't identify a causal effect, no non-parametric method can. but completeness doesn't select the true graph from among Markov-equivalent alternatives. without a complexity penalty, any fully connected DAG is consistent with the identifiability constraints. MDL is needed to select the sparsest consistent graph.

causal IB alone is undefined without a graph. the standard Information Bottleneck (Tishby, Pereira, and Bialek 2000) compresses input $X$ into representation $T$ while preserving prediction of $Y$:

\[\min_T \; I(X; T) - \beta \cdot I(T; Y)\]

when a hidden confounder $Z$ causes both $X$ and $Y$, $Z$-correlated features are statistically predictive of $Y$ and will be preserved in $T$ even though they carry no direct causal information. a clinical trial where healthier patients self-select into treatment has health status correlated with recovery. standard IB preserves health status in $T$ because it predicts recovery, even if the drug itself does nothing.

Simoes, Dastani, and van Ommen (2024) fix this by replacing $I(T; Y)$ with the causal mutual information $I_c(Y \mid \mathrm{do}(T))$: the do-operator severs every arrow into $T$, so associations through shared upstream causes contribute nothing. only the direct causal path $T \to Y$ is measured. the causal IB objective:

\[\min_T \; I(X; T) - \beta \cdot I_c(Y \mid \mathrm{do}(T))\]

computing $I_c(Y \mid \mathrm{do}(T))$ requires applying the do-operator, which requires a causal graph. the standard IB is computable but measures the wrong thing. the causal IB measures the right thing but needs the graph first.

CCL closes the loop: XGES uses CCA scores to orient edges and MDL to select the graph. MDL uses the graph to distinguish confounders from direct effects. causal IB uses the graph to compute $I_c$. PAC bounds use $d_c(G)$ from the graph. each component provides what the others need.

Theorem S (MDL efficiency inheritance): for linear Gaussian SCMs and additive noise models with sub-Gaussian noise, the MDL approximation gap satisfies $\lvert M(G) - K(G) \rvert \leq O(\log n)$, where $M(G)$ is the computable MDL score and $K(G)$ is the true Kolmogorov complexity. against hypothesis classes that model the full joint distribution without causal factorization, CCL under MDL achieves the causal optimum within poly(n) overhead while non-parametric learners require exponentially more samples. the paper is explicit that this advantage does not extend to parametric learners that already factorize over edges, so it’s not a claim of general sample efficiency superiority.

The CCL+ objective

\[\min_{G, T, \pi} \; \mathcal{L}_\mathrm{CCL+} = \underbrace{-E_\pi[R(Y)]}_{\text{term 1}} + \underbrace{\lambda_1\left[I(X;T) - \beta \cdot I_c(Y \mid \mathrm{do}(T))\right]}_{\text{term 2}} + \underbrace{\lambda_2 \cdot \mathrm{MDL}(G)}_{\text{term 3}} + \underbrace{\lambda_3 \cdot \mathrm{CCA}(G)}_{\text{term 4}}\]

subject to: $P(Y \mid \mathrm{do}(\pi))$ identifiable in $G$ via do-calculus.

term 1 is the standard causal RL objective: maximize expected reward under policy $\pi$. the identifiability constraint is hard: if the do-calculus cannot compute $P(Y \mid \mathrm{do}(\pi))$ from the current graph structure, the policy optimization step is skipped for that iteration.

term 2 is the causal IB: compress $X$ into $T$ by minimizing $I(X; T)$ while maximizing the causal information $I_c(Y \mid \mathrm{do}(T))$. features predictive through confounders are penalized. features predictive through actual causal mechanisms are rewarded. implemented via a VAE with diagonal Gaussian posterior covariance, which is mathematically required to prevent the encoder from merging distinct causal parents into a single entangled latent dimension.

term 3 is MDL on the graph, approximated by BIC: $\mathrm{MDL}(G) \approx -\ell(G) + \frac{\lvert E_G \rvert}{2}\log n$. this penalizes complex graphs and drives toward sparser solutions. Kaltenpoth and Vreeken showed that for linear Gaussian SCMs, refined MDL is asymptotically equivalent to BIC, connecting this to classical model selection theory.

term 4 is CCA: it adds a bonus or penalty on top of the MDL score for each candidate edge orientation in XGES. negative CCA score on edge $(i,j)$ (forward faster) adds a bonus for $X_i \to X_j$. positive CCA score adds a penalty. XGES then searches for the graph maximizing $\mathrm{MDL}(G) + \lambda_3 \cdot \mathrm{CCA}(G)$.

MDL regularization threshold theorem. at any local minimum of $\mathcal{L}\mathrm{CCL}$, if $\lambda_2 \geq \frac{(1-\gamma)\log\lvert V \rvert}{\lvert E\mathrm{max} \rvert}$, spurious edges are excluded asymptotically as $n \to \infty$. the argument: a spurious edge gains at most $O(\log n / n) \to 0$ in MDL likelihood at finite $n$. adding a spurious edge costs at least $(1-\gamma)\log\lvert V \rvert$ in MDL complexity. above the threshold, the complexity cost exceeds the likelihood gain asymptotically. for the three-variable experiment ($\lvert V \rvert = 3$, $\lvert E_\mathrm{max} \rvert = 3$, $\gamma = 0.9$), the threshold is approximately $\frac{0.1 \cdot \ln 3}{3} \approx 0.037$.

CCA feedback bound. for any two graphs differing by a single edge flip:

\[|\mathcal{L}_\mathrm{CCL+}(G\') - \mathcal{L}_\mathrm{CCL+}(G)| \leq \frac{\lambda_3 T_\mathrm{max}}{1-\gamma} + \lambda_1 C_\mathrm{IB} + \lambda_2 \log|V|\]

$T_\mathrm{max} < \infty$ by the training cap. $C_\mathrm{IB}$ is bounded by Lipschitz continuity of $I_c$. this bound is what allows convergence analysis: Zangwill's 1969 theorem applies when the objective is bounded below, each coordinate update decreases the objective, and the change per update is bounded. the convergence theorem: $\mathcal{L}_\mathrm{CCL+}$ converges to a stationary point under alternating coordinate descent for all $\lambda_3 \geq 0$.

the CCL+ optimization has a bilevel structure: graph selection (upper level) depends on CCA scores from neural network training (lower level). this means the graph search and the neural network training are not independent: every time XGES considers flipping an edge, it needs to train two networks to convergence to measure the CCA score for that candidate orientation. Ji, Yang, and Liang (2021) proved convergence for bilevel problems when the lower level is strongly convex; the CCA feedback bound (equation above) provides the analogous result for the nonconvex neural network case.

Faithfulness preservation under causal IB compression

faithfulness is the assumption that every conditional independence in the data distribution comes from a d-separation in the true causal graph, with no accidental cancellations. without faithfulness, conditional independence tests used in graph learning algorithms like PC-stable give unreliable results.

the concern: the causal IB compression step might destroy faithfulness. if the encoder merges two distinct causal parents $X_i$ and $X_j$ into one latent dimension $T$, the d-separations of the compressed distribution might not match the true graph.

theorem (faithfulness preservation): under four conditions, (1) each identifiable edge has positive causal mutual information, (2) the encoder uses diagonal Gaussian covariance, (3) $P(X, Y)$ has full support, (4) the encoder converges at rate $O(1/k)$, every local minimum of the causal IB objective near the optimal encoder satisfies faithfulness between $P(T^{(k)})$ and the true causal graph up to $\varepsilon_G$ in total variation as $k \to \infty$.

proof sketch: forward direction. if $X_i \perp X_j \mid Z$ in the true graph (a d-separation), the data processing inequality plus the diagonal covariance (which prevents merging distinct parents) gives $T_i \perp T_j \mid \mathrm{enc}(Z)$ in $P(T)$. reverse direction. if $T_i \perp T_j \mid T_Z$ in $P(T^{(k)})$ but $X_i$ is not d-separated from $X_j$ given $Z$ in the true graph, then by faithfulness of $P(V)$, $I(X_i; X_j \mid Z) > 0$, and by convergence of the encoder and continuity of $I_c$, $I(T_i; T_j \mid T_Z) > 0$ for large $k$, giving a contradiction.

the diagonal covariance is doing real mathematical work here: standard VAEs use diagonal Gaussian posteriors precisely because factored latent spaces prevent collapsing multiple causal parents into one entangled direction. the local convergence caveat is significant: this holds near the optimal encoder, not globally from arbitrary initialization.

the MDL-R lemma shows graph learning stays consistent as the encoder evolves: if the encoder converges at rate $\lVert \mathrm{enc}k - \mathrm{enc}^* \rVert\infty \leq C/k$, the MDL-selected graph converges to the true graph as $n, k \to \infty$ jointly. the perturbation of MDL scores from using $T^{(k)}$ instead of $T^*$ is $O(n/k)$, while the MDL gap between correct and incorrect graphs grows as $O(\log n)$. for $k \geq \sqrt{n}$, the perturbation shrinks faster than the gap, so correct graph recovery probability goes to 1.

The four-stage algorithm

stage 0. run PC-stable (Colombo and Maathuis 2014) on observational data to recover the undirected skeleton: which pairs of variables are connected, without direction information. PC-stable recovers the skeleton with probability $1 - \delta$ for:

\[n \geq O\left(\frac{\log(|V|^2 / \delta)}{\alpha^2}\right)\]

where $\alpha$ is the minimum partial correlation magnitude. PC-stable is order-independent (unlike the original PC algorithm) and handles high-dimensional variable sets.

stage 1. train the causal IB encoder $T_0$ via variational optimization conditioned on the skeleton $G_0^\mathrm{skel}$ from stage 0. use a VAE with diagonal covariance. the generalized EM (GEM) property guarantees monotone decrease of the causal IB objective during training.

stage 2. run XGES with scoring function $\mathrm{Score}(G) = \mathrm{MDL}(G) + \lambda_3 \cdot \mathrm{CCA}(G)$ over $T_0$ and any available interventional data. XGES (Nazaret and Blei 2024) is an extremely greedy equivalence search extending GES (Chickering 2002) with improved computational efficiency while preserving correctness guarantees. scoring each candidate edge requires training two MLPs and measuring their convergence times.

stage 3. optimize policy $\pi_1$ via Bareinboim CRL (2024) over $G_1$ and $T_0$. perform an identifiability check before each policy gradient step: if $P(Y \mid \mathrm{do}(\pi))$ is not identifiable in the current graph, skip the update.

alternating loop. update $T$ by minimizing $I(X;T) - \beta I_c(Y \mid \mathrm{do}(T))$, update $G$ using XGES with MDL + CCA scoring, update $\pi$ via policy gradient subject to identifiability, repeat until $\lvert \Delta\mathcal{L}\mathrm{CCL+} \rvert \leq \varepsilon\mathrm{conv}$.

output. causal graph $G$, compressed representation $T$, causal policy $\pi$, and a sample complexity certificate:

\[n^* = C \cdot \hat{\tau}_\mathrm{mix} \cdot d_c(G) \cdot \frac{\log(d_c(G)/\delta)}{(1-\gamma)^3 \varepsilon^2}\]

Experiment 2: convergence of the CCL loop

convergence theorem says the full CCL+ alternating loop reaches a stationary point. this experiment tests it on a three-variable SCM: $X_1 \to X_2 \to X_3$ and $X_1 \to X_3$, with mechanisms $X_2 = X_1^2 + \varepsilon_1$ and $X_3 = X_2 + 0.5 X_1 + \varepsilon_2$, $N = 1000$ samples, sweeping seven values of $\lambda_2$ from 0.01 to 0.5.

$\lambda_2$	monotone decrease	$\mathcal{L}$ initial	$\mathcal{L}$ final	iterations	spurious edges
0.010	yes	10.000	-1.238	3	1
0.050	yes	10.000	-1.106	3	1
0.080	yes	10.000	-1.007	3	1
0.100	yes	10.000	-0.942	3	1
0.150	yes	10.000	-0.777	3	1
0.200	yes	10.000	-0.612	3	1
0.500	yes	10.000	+0.105	1	0

all seven runs show strictly monotone decrease. the objective goes down at every iteration without exception, confirming the convergence theorem.

the spurious edge story is equally clean. one spurious edge persists at $\lambda_2 \leq 0.2$ and disappears at $\lambda_2 = 0.5$. the theoretical threshold for this SCM is approximately 0.037, and $\lambda_2 = 0.5$ is well above it. the reason spurious edges survive below 0.5 despite being above the asymptotic threshold is that the non-injective mechanism $X_2 = X_1^2$ produces a misleading CCA signal (the reverse network collapses to predicting zero in under 25 steps), which partially offsets the MDL penalty. the monotone convergence holds regardless of this, since the theorem guarantees the objective decreases, not that the graph is correct.

the CCL validation is currently limited to this one three-variable synthetic experiment. the causal IB, PAC bound, faithfulness preservation, and policy optimization components are proved theoretically but not yet validated empirically on multi-variable benchmark datasets. this is an honest limitation.

Honest accounting

dimensional scope. the full theory and all experiments are for scalar bivariate variables. extending to vector-valued inputs $\mathbf{Y} = f(\mathbf{X}) + \boldsymbol{\varepsilon}$ with $\mathbf{X} \in \mathbb{R}^p$ requires re-examining injectivity conditions for high-dimensional functions (far harder to verify or even define in general), the law of total variance as a matrix decomposition, the PL condition for multivariate network regression, and whether element-wise z-scoring is sufficient or whether PCA/whitening is needed when input dimensions are correlated. none of this is worked out yet.

injectivity requirement. 0/10 on $Y = X^2$. saturating functions in biology, sigmoid-like enzyme kinetics, threshold effects, periodic functions: many real mechanisms are plausibly non-injective over parts of their domain. a practical deployment of CCA requires an injectivity screening step before committing to a direction. the paper does not propose one. a candidate approach would be checking whether forward residuals show multimodal structure as a function of the input, which would indicate that different input values are mapping to the same output region. this is speculative and unvalidated.

hyperparameter sensitivity on Tubingen. the 96% result comes from one configuration of $\tau$ and $T_\mathrm{max}$. there is no ablation over these hyperparameters on the benchmark. how sensitive accuracy is to the threshold and training cap on varied real-world mechanisms is unknown. this analysis needs to be done.

local PL condition. the theorem gives a lower bound on the step count gap using local PL near minima. the actual gap is 19-fold on the cubic. the bound only guarantees existence of a gap of order $\Omega(1/\eta\mu)$. for quantitative predictions beyond existence, you would need global landscape characterization, which for neural networks is an open problem.

mixing time estimation. $\tau_\mathrm{mix}$ in the causal PAC bound is estimated from the observed policy trajectory. standard spectral gap or empirical autocorrelation estimators give noisy results, and the sample complexity certificate is sensitive to this estimate.

interventional data required. stages 2 and 3 of CCL need interventional data: actual experiments where variables are forced to take specific values. bivariate CCA works on purely observational data, but the full framework does not.

asymptotic guarantees only. the MDL threshold theorem, faithfulness preservation, and PAC bound all hold as $n \to \infty$. finite-sample convergence rates are not characterized, and this is listed as an explicit open problem in the paper.

The thermodynamic analogy

in statistical mechanics, computing forward in time is tractable: given the current state of a system and the laws of dynamics, predicting future states follows the direction of entropy increase. the second law says entropy increases in isolated systems, so the forward direction is thermodynamically natural. reconstructing past states from the current one requires inverting dynamics that have destroyed information, and that inversion is computationally hard because the entropy that was generated cannot be recovered without an external record of what was lost.

the mechanism $Y = f(X) + \varepsilon$ adds noise irreversibly. once $Y = f(X) + \varepsilon$ is formed, the information about which specific combination of $(X, \varepsilon)$ produced it is entangled inside $Y$. for injective $f$, that information is in principle recoverable if you knew $\varepsilon$ exactly, but the entanglement makes extraction expensive: you can’t just read off $X$ from $Y$ without solving a deconvolution problem. the neural network’s convergence difficulty in the reverse direction parallels the physical difficulty of reconstructing past states from current ones: both require undoing something that got mixed together irreversibly.

Kolmogorov complexity sharpens this analogy. the causal description consists of three independent components: $P(X)$, mechanism $f$, and $P(\varepsilon)$. because these three components are generated independently by nature, they compress independently, and the total description length is the sum of three separate short programs. the anticausal description must encode $P(Y)$ and then $P(X \mid Y)$, where $P(X \mid Y)$ requires knowing both the forward mechanism and the noise distribution jointly in order to describe how to invert the noisy nonlinear function. that joint encoding is longer because the two components are entangled: you can’t describe the reverse mechanism without referencing the noise structure it’s trying to undo. shorter programs are simpler computations. simpler computations converge faster. the forward direction is computationally simpler in a precise information-theoretic sense, and that simplicity manifests directly as faster neural network training.

no formal proof connecting neural network optimization landscapes to thermodynamic irreversibility currently exists. the analogy is structurally suggestive and may point toward a deeper principle: wherever a generative process has the form “mechanism applied to cause, then independent noise added,” the forward direction is computationally privileged, and that privilege should show up in any learning algorithm that is sensitive to the structure of the optimization landscape.

Comparison

system	rung 2 capable	PAC bound	sample efficiency	OOD robust	causal direction	scope
GPT-4 / DeepSeek R1	no	no	no	no	no	observational corpus
Bareinboim CRL	yes	no	no	yes	no (takes graph as input)	interventional
ANM / RESIT	partial	no	partial	partial	yes	bivariate
SkewScore (Lin 2025)	no	no	partial	partial	yes (heteroscedastic)	bivariate
CCA / CCL+	yes†	yes†	yes†‡	yes†	yes	injective ANM

the † markers mean theoretical only: the PAC bound, sample efficiency, OOD robustness, and rung-2 capability of CCL+ are all proved but not yet validated empirically beyond the Tubingen benchmark. the ‡ marker means the sample efficiency advantage applies specifically against non-parametric learners that model the full joint distribution without causal factorization. parametric learners that already factorize over edges do not face the exponential sample requirement that CCL+ avoids.

GPT-4 appears here not as a competitor but to make Pearl's impossibility concrete. GPT-4 is trained on observational data and by theorem cannot answer interventional questions from training alone. Bareinboim CRL is the most relevant prior work: it handles rung-2 reasoning and out-of-distribution robustness but requires the causal graph as input. CCL+ adds graph learning using CCA for edge orientation, MDL for graph complexity, and causal IB for faithful representations.

What this could do in practice

medicine and drug development. the clearest application is separating drug effects from patient selection effects without a randomized trial. biomarker causation: does this protein cause the disease, or does the disease upregulate the protein? the primary practical requirement is an injectivity screening step, since biological mechanisms often saturate or have threshold behaviors over parts of their range.

economics and policy evaluation. does education cause higher earnings, or do families with more resources invest in both simultaneously? does a minimum wage increase cause unemployment, or do strong labor markets adopt higher minimum wages first? instrumental variable methods address similar questions but require strong assumptions about the instrument. CCA makes different assumptions and requires an approximately nonlinear injective mechanism, which is plausible for many economic relationships in the observed range.

genetics and gene regulation. does gene A regulate gene B, or is co-expression from a shared upstream factor? CCA on observational expression data could prioritize which regulatory edges to target before running expensive perturbation experiments like CRISPR screens. the CCL confounding framework is designed specifically for situations where unmeasured genetic background produces spurious correlations between genes with no direct relationship.

climate and environmental science. extending CCL to time series with feedback loops requires additional theory beyond the static ANM setting, but the core convergence asymmetry signal may survive in settings where a lagged causal structure creates directional asymmetry in the optimization landscape.

Toward rung 3

CCL operates on rung 2. rung 3 requires counterfactual reasoning: $P(Y_{X=x'} \mid X = x, Y = y)$, the probability of what $Y$ would have been under a different treatment, given what was actually observed. both worlds, actual and counterfactual, are never simultaneously observable.

Pearl's abduction-action-prediction cycle handles this. step 1, abduction: given the observed $(X = x, Y = y)$ and the causal model, infer the specific background noise variables $U$ that must have been operating to produce this observation. this step inverts the structural equations. step 2, action: modify the causal model by forcing $X$ to $x'$, leaving the inferred $U$ fixed. step 3, prediction: compute what $Y$ would be in the modified model with the inferred $U$. the result is the counterfactual outcome.

twin-network models implement this computationally: two copies of the structural causal model run in parallel sharing the inferred noise variables, one representing the actual world and one the counterfactual world with the intervention applied. reading off the difference gives the counterfactual effect.

CCL is infrastructure for this, not an alternative. you need the correct causal graph before abduction can invert the structural equations. you need the compressed representation $T$ to be faithful to the graph structure. CCA provides the edge directions, CCL provides the full graph and the faithful representation, and twin-network counterfactual inference builds on top of that foundation. extending CCL to rung 3 by adding a twin-network abduction module as a stage 4 after policy optimization is the natural next step.

Conclusion

the central observation is that training a neural network in the true causal direction is an easier optimization problem than training in the reverse, not by a small amount, but by a large and structurally explainable amount.

Lemma 1: reverse residuals remain correlated with the input throughout training for any finite approximation of the reverse function. Lemma 2: that correlation creates a higher irreducible loss floor and non-separable gradient covariance in the reverse direction. Lemma 3: a harder optimization landscape requires more expected steps under the Polyak-Lojasiewicz condition. chain complete.

the experiments confirm the predictions and their limits simultaneously: 30/30 on sine and exponential across six architectures, 23/30 on the cubic with z-scoring, 0/30 on linear Gaussian (predicted by theory), 0/10 on the non-injective quadratic (predicted by theory). 96% on 108 real-world Tubingen pairs from one hyperparameter configuration, with failures clustering at near-zero CCA scores where the method is appropriately uncertain. the paper reports both accuracy and AUC on the Tubingen benchmark. the AUC is 0.96, matching the accuracy figure, meaning the CCA score is well-calibrated as a ranking signal: pairs with stronger CCA scores are more likely to be correct, not just pairs above a binary threshold.

the gaps: no injectivity screening test, no hyperparameter sensitivity analysis on the benchmark, limited empirical validation of the full CCL pipeline, asymptotic-only theoretical guarantees throughout. the CCL framework has five supporting theorems, all proved, and one synthetic convergence experiment confirming the alternating loop.

the multivariate extension, real-world causal benchmark validation beyond Tubingen, and rung-3 counterfactual reasoning are the primary remaining steps.

research: “Causal Direction from Convergence Time: Faster Training in the True Causal Direction,” Abdulrahman Tamim, February 2026.

References

[1] J. Pearl, “Causal diagrams for empirical research,” Biometrika, vol. 82, no. 4, pp. 669-688, 1995.

[2] J. Pearl, Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.

[3] Y. Huang and M. Valtorta, “Pearl's calculus of intervention is complete,” in Proc. UAI, 2006.

[4] R. Solomonoff, “A formal theory of inductive inference,” Information and Control, vol. 7, no. 1, pp. 1-22, 1964.

[5] A. N. Kolmogorov, “Three approaches to the quantitative definition of information,” Problems of Information and Transmission, vol. 1, no. 1, pp. 1-7, 1965.

[6] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465-471, 1978.

[7] L. G. Valiant, “A theory of the learnable,” Commun. ACM, vol. 27, no. 11, pp. 1134-1142, 1984.

[8] S. Hanneke, “The optimal sample complexity of PAC learning,” JMLR, vol. 17, no. 38, pp. 1-15, 2016.

[9] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv:physics/0004057, 2000.

[10] F. N. F. Q. Simoes, M. Dastani, and T. van Ommen, “The causal information bottleneck and optimal causal variable abstractions,” arXiv:2410.00535, 2024.

[11] E. Bareinboim, J. Zhang, and S. Lee, “An introduction to causal reinforcement learning,” Foundations and Trends in Machine Learning, vol. 17, no. 3, pp. 304-589, 2024.

[12] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” in Proc. ICLR, 2017.

[13] D. Kaltenpoth and J. Vreeken, “Causal discovery with hidden confounders using the algorithmic Markov condition,” in Proc. UAI, 2023.

[14] P. Grunwald, The Minimum Description Length Principle. MIT Press, 2007.

[15] G. Schwarz, “Estimating the dimension of a model,” Ann. Stat., vol. 6, no. 2, pp. 461-464, 1978.

[16] P. Buhlmann, J. Peters, and J. Ernest, “CAM: Causal additive models, high-dimensional order search and penalized regression,” Ann. Stat., vol. 42, no. 6, pp. 2526-2556, 2014.

[17] D. Janzing and B. Scholkopf, “Causal inference using the algorithmic Markov condition,” IEEE Trans. Inf. Theory, vol. 56, no. 10, 2010.

[18] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search, 2nd ed. MIT Press, 2000.

[19] D. M. Chickering, “Optimal structure identification with greedy search,” JMLR, vol. 3, pp. 507-554, 2002.

[20] M. Nazaret and D. Blei, “XGES: Extremely greedy equivalence search for causal discovery,” in Proc. ICML, 2024.

[21] D. Colombo and M. H. Maathuis, “Order-independent constraint-based causal structure learning,” JMLR, vol. 15, no. 1, pp. 3741-3782, 2014.

[22] I. Shpitser and J. Pearl, “Identification of joint interventional distributions in recursive semi-Markovian causal models,” in Proc. AAAI, 2006.

[23] J. Peters, J. M. Mooij, D. Janzing, and B. Scholkopf, “Causal discovery with continuous additive noise models,” JMLR, vol. 15, pp. 2009-2053, 2014.

[24] Y. Lin, Y. Huang, W. Liu, H. Deng, I. Ng, K. Zhang, M. Gong, Y.-A. Ma, and B. Huang, “A skewness-based criterion for addressing heteroscedastic noise in causal discovery,” in Proc. ICLR, 2025.

[25] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Scholkopf, “Distinguishing cause from effect using observational data: Methods and benchmarks,” JMLR, vol. 17, no. 32, pp. 1-102, 2016.

[26] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proc. COMPSTAT, pp. 177-186, 2010.

[27] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Proc. NeurIPS, 1999.

[28] OpenAI, “GPT-4 technical report,” arXiv:2303.08774, 2023.

[29] DeepSeek, “DeepSeek-V3 technical report,” arXiv:2412.19437, 2025.

[30] W. I. Zangwill, Nonlinear Programming: A Unified Approach. Prentice-Hall, 1969.

[31] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine Learning, vol. 37, pp. 183-233, 1999.

[32] B. Yu, “Rates of convergence for empirical processes of stationary mixing sequences,” Ann. Probab., vol. 22, no. 1, pp. 94-116, 1994.

This post is licensed under CC BY 4.0 by the author.

The ladder of causation and why modern AI cannot climb it

Prerequisites: the mathematical foundations you need

What a neural network is and how it represents functions

Loss functions, MSE, and what convergence means

What a loss curve is and how to read it

Gradient descent: the algorithm that trains neural networks

The Polyak-Lojasiewicz condition and why it matters for convergence rates

Statistical independence vs correlation: a distinction that matters

Kolmogorov complexity and minimum description length

Directed acyclic graphs and d-separation

What confounding is and why it corrupts observational studies

Information entropy and mutual information

Covariance matrices and why diagonal structure matters

Mean squared error vs absolute error and why squaring is the right choice here

Covariance matrices and why diagonal structure matters

What z-scoring does

Gaussian moments and how to compute them

Lipschitz functions and why smooth mechanisms matter

Random variables, probability distributions, and the Gaussian

Expected value, variance, covariance, and the law of total variance

Conditional expectation and the tower property

What independence means

Derivatives, gradients, and the Jacobian

The additive noise model: the framework and why injectivity is non-negotiable

What existed before CCA and where each method breaks

The three lemmas and the asymmetry theorem

z-scoring: the preprocessing step that determines whether the method works at all

The CCA algorithm

The canonical experiment

Experiment 4: what the gradient variance data shows

The Tubingen benchmark: 96% on real-world data

Going beyond bivariate: the CCL framework

The CCL+ objective

Faithfulness preservation under causal IB compression

The four-stage algorithm

Experiment 2: convergence of the CCL loop

Honest accounting

The thermodynamic analogy

Comparison

What this could do in practice

Toward rung 3

Conclusion

References

Trending Tags