General ML

Autoencoders

Learn what matters by being forced to throw the rest away

01 · First principlesA label from nowhere

We want useful representations of unlabelled data — the central wish of unsupervised learning. The autoencoder's move is to manufacture a supervised task out of thin air: the input is its own label. An encoder maps x to a code z; a decoder maps z back to a reconstruction x̂; train both to make x̂ ≈ x:

z = encθ(x),   x̂ = decφ(z),    minθ,φ Σ ‖x − x̂‖²

The representation z is not the target of the loss at all — it is a side effect, the residue left at the network's waist by the effort of reconstruction. That indirection is the whole design.

02 · Failure firstIdentity is a perfect, useless solution

Stated naively, the task is trivial: if z has at least the dimension of x, the network can learn the identity map — copy in, copy out, zero loss, nothing learned. A perfect score and a worthless representation. So the real design question of autoencoders is not the loss but the constraint: what stops copying?

x · 784-d ENCODER z · 16-d the waist DECODER x̂ ≈ x

784 numbers in, 16 allowed through, 784 demanded back. Whatever fits through the waist is, by force, the essence.

The anchor fact: a linear autoencoder with squared loss learns exactly the subspace spanned by the top principal components — PCA's subspace (not necessarily its orthogonal axes, but the same span). Autoencoders are nonlinear PCA; PCA is the autoencoder you can solve in closed form. This is the calibration point for everything else.

03 · The fertile variantDenoising: predict clean from corrupted

The denoising autoencoder changes one line: corrupt the input (mask pixels, add Gaussian noise, zero random features) and train to recover the uncorrupted original:

x̃ = corrupt(x),    min  ‖x − dec(enc())‖²

This is a stronger demand than reconstruction. To fill in what was destroyed, the model must learn the dependencies between parts — what typically co-occurs with what — which is to say the shape of the data distribution around each point. Geometrically, it learns to push corrupted points back toward the data manifold.

It is hard to overstate where this one idea went. Scale it up and iterate the noise levels and you have diffusion models (denoise from pure noise, step by step — see the generative notes for the probabilistic framing). Apply it to tokens instead of pixels and "corrupt then reconstruct" is exactly masked-language-model pretraining (BERT), and next-token prediction is its causal cousin. The denoising objective is arguably the most consequential trick in self-supervised learning.

04 · The latent spaceAnd its holes

After training, the encoder is the useful artifact: z is a compact feature vector for downstream classifiers, similarity search, anomaly detection (reconstruction error flags points the model never learned to compress — off-manifold inputs reconstruct badly). It is tempting to go one step further and treat the decoder as a generator: pick a z, decode, get a new sample.

This fails, instructively. The plain autoencoder is only ever trained on the z-values of actual training points, and nothing in the objective organises the space between them. The latent space ends up with holes: regions no training point ever mapped to, where the decoder's output is unconstrained garbage. Codes for similar inputs need not even be near each other.

The VAE in one line: make the encoder output a distribution q(z|x) rather than a point, and penalise its divergence from a fixed prior — which forces the codes to fill the space densely and smoothly, so that every z near the origin decodes to something sensible. Derivation and the ELBO live in VAE & ELBO.

05 · PlacementWhat autoencoders are for now

UseStatusNote
Dimensionality reduction / visualisationsolidNonlinear PCA when PCA's linear subspace is not enough.
Anomaly detectionsolidReconstruction error as an off-manifold alarm; a production workhorse.
Compression for generative pipelinescentralLatent diffusion runs diffusion inside a learned autoencoder's z-space; the bottleneck is what makes it affordable.
Direct generationsupersededHoles in the latent space; use VAEs, diffusion, or flows.
General representation pretrainingmostly supersededContrastive and masked-prediction objectives learn stronger features than pure reconstruction.
Mental Model