SSL Toy

01

The Problem: Dimensional Collapse

In Self-Supervised Learning, we train models to map multiple "views" (e.g., jittered versions) of the same object to the same location in embedding space. Without a counter-force, models find a trivial "shortcut": they collapse representations into a lower-dimensional subspace—a point, a line, or a plane. This dimensional collapse discards nearly all the information required for downstream tasks.

02

The Ideal: Isotropic Gaussian

The LeJEPA paper provides a mathematical proof that the Isotropic Gaussian is the unique optimal distribution for minimizing risk in both linear and nonlinear probing—i.e. maximizing utility of the embeddings for downstream classification-like tasks. It represents the maximum entropy state for a fixed variance, ensuring no single dimension dominates the representation.

03

The Solution: SIGReg & Epps-Pulley

Implementing a multivariate Gaussian test in high dimensions is notoriously difficult and computationally expensive, with traditional tests often scaling quadratically (\(O(N^2)\)) or worse with the number of samples. SIGReg (Sketched Isotropic Gaussian Regularization) bypasses this by projecting embeddings onto random 1D directions and applying the Epps-Pulley test to each projection, maintaining linear (\(O(N)\)) complexity.

The test statistic \(T_{EP}\) is defined as:

T_{EP} = N \int_{-\infty}^{\infty} |\hat{\phi}(t) - \phi(t)|^2 e^{-t^2/2} dt

where \(\hat{\phi}(t)\) is the Empirical Characteristic Function (the discrete Fourier transform of the sample distribution), given by:

\hat{\phi}(t) = \frac{1}{N} \sum_{j=1}^N e^{itX_j}

By operating on the ECF, we get a bounded loss with stable, non-zero gradients even when the data is highly clustered. Furthermore, because the complex exponentials are bounded, the gradients are naturally constrained—effectively giving us gradient clipping "for free" without manual heuristics.

04

The Full Objective: Clustering Without Collapse

The model achieves meaningful representations by incorporating two distinct mathematical effects into a single objective function:

\mathcal{L}_{total} = \mathcal{L}_{attr} + \lambda\, \mathcal{L}_{\rm SIGReg}

Here, \(\mathcal{L}_{attr}\) (Attraction) pulls views of the same object together. To prevent collapse, \(\mathcal{L}_{\rm SIGReg} = T_{EP}\) (Regularization) applies the Epps-Pulley test to force the overall distribution toward an isotropic Gaussian. The parameter \(\lambda\) (set to 100 in this demo) controls the relative strength of these two effects.

Self-Supervised Learning demo - TensorFlow.js with Autograd

How LeJEPA Works

The Problem: Dimensional Collapse

The Ideal: Isotropic Gaussian

The Solution: SIGReg & Epps-Pulley

The Full Objective: Clustering Without Collapse