General ML

Autoencoders

Learn what matters by being forced to throw the rest away

01 · First principlesA label from nowhere

We want useful representations of unlabelled data — the central wish of unsupervised learning. The autoencoder's move is to manufacture a supervised task out of thin air: the input is its own label. An encoder maps x to a code z; a decoder maps z back to a reconstruction x̂; train both to make x̂ ≈ x:

z = enc_θ(x), x̂ = dec_φ(z), min_θ,φ Σ ‖x − x̂‖²

The representation z is not the target of the loss at all — it is a side effect, the residue left at the network's waist by the effort of reconstruction. That indirection is the whole design.

02 · Failure firstIdentity is a perfect, useless solution

Stated naively, the task is trivial: if z has at least the dimension of x, the network can learn the identity map — copy in, copy out, zero loss, nothing learned. A perfect score and a worthless representation. So the real design question of autoencoders is not the loss but the constraint: what stops copying?

The bottleneck: make dim(z) ≪ dim(x). Now copying is impossible; the encoder must triage. To reconstruct well through a narrow waist, it must spend its few dimensions on whatever explains the most variation in the data — structure survives, incidental detail is discarded. Compression is forced understanding.
Noise: corrupt the input, demand the clean original (next section). Copying the input now reproduces the corruption and is penalised.
Sparsity: let z be wide but penalise activation (L1 on z), so only a few units may fire per input — a learned dictionary rather than a compression.

784 numbers in, 16 allowed through, 784 demanded back. Whatever fits through the waist is, by force, the essence.

The anchor fact: a linear autoencoder with squared loss learns exactly the subspace spanned by the top principal components — PCA's subspace (not necessarily its orthogonal axes, but the same span). Autoencoders are nonlinear PCA; PCA is the autoencoder you can solve in closed form. This is the calibration point for everything else.

03 · The fertile variantDenoising: predict clean from corrupted

The denoising autoencoder changes one line: corrupt the input (mask pixels, add Gaussian noise, zero random features) and train to recover the uncorrupted original:

x̃ = corrupt(x), min ‖x − dec(enc(x̃))‖²

This is a stronger demand than reconstruction. To fill in what was destroyed, the model must learn the dependencies between parts — what typically co-occurs with what — which is to say the shape of the data distribution around each point. Geometrically, it learns to push corrupted points back toward the data manifold.

It is hard to overstate where this one idea went. Scale it up and iterate the noise levels and you have diffusion models (denoise from pure noise, step by step — see the generative notes for the probabilistic framing). Apply it to tokens instead of pixels and "corrupt then reconstruct" is exactly masked-language-model pretraining (BERT), and next-token prediction is its causal cousin. The denoising objective is arguably the most consequential trick in self-supervised learning.

04 · The latent spaceAnd its holes

After training, the encoder is the useful artifact: z is a compact feature vector for downstream classifiers, similarity search, anomaly detection (reconstruction error flags points the model never learned to compress — off-manifold inputs reconstruct badly). It is tempting to go one step further and treat the decoder as a generator: pick a z, decode, get a new sample.

This fails, instructively. The plain autoencoder is only ever trained on the z-values of actual training points, and nothing in the objective organises the space between them. The latent space ends up with holes: regions no training point ever mapped to, where the decoder's output is unconstrained garbage. Codes for similar inputs need not even be near each other.

The VAE in one line: make the encoder output a distribution q(z|x) rather than a point, and penalise its divergence from a fixed prior — which forces the codes to fill the space densely and smoothly, so that every z near the origin decodes to something sensible. Derivation and the ELBO live in VAE & ELBO.

05 · PlacementWhat autoencoders are for now

Use	Status	Note
Dimensionality reduction / visualisation	solid	Nonlinear PCA when PCA's linear subspace is not enough.
Anomaly detection	solid	Reconstruction error as an off-manifold alarm; a production workhorse.
Compression for generative pipelines	central	Latent diffusion runs diffusion inside a learned autoencoder's z-space; the bottleneck is what makes it affordable.
Direct generation	superseded	Holes in the latent space; use VAEs, diffusion, or flows.
General representation pretraining	mostly superseded	Contrastive and masked-prediction objectives learn stronger features than pure reconstruction.

Mental Model

An autoencoder turns unlabelled data into a supervised task: the input is its own label, and the representation is the side effect.
Unconstrained, the perfect solution is the identity; the constraint (bottleneck, noise, sparsity) is the actual model design.
Linear AE = PCA's subspace — the closed-form anchor for the whole family.
Denoising (predict clean from corrupted) is the seed of diffusion models and masked-LM pretraining.
The plain latent space has holes, so it cannot generate; the VAE fixes this by making codes distributions tied to a prior.