On Learning, Forgetting, and the Missing Night

The question is not whether machines can think, but whether we understand thinking well enough to know what we're building.

paraphrased from a conversation

The standard framing of LLM training pretraining, then SFT, then RL is usually presented as an engineering pipeline. What gets less attention is that this sequence is also a compressed recapitulation of how biological intelligence develops. The analogy runs three levels deep, holds uncomfortably well at each level, and breaks down in exactly the places where the most important open problems in ML are hiding.

This piece develops that analogy carefully not as metaphor, but as a source of concrete research hypotheses. Three phenomena in particular (sleep consolidation, simultaneous multi-phase learning, and strategic compression) point toward architectural innovations that frontier models currently lack entirely. Understanding why these gaps exist, and what it would take to close them, is the central question.


Three levels of pretraining, not one

The naive version of the analogy begins with a false premise. It says: LLM pretraining maps onto infant sensory development. But children are not blank slates. A newborn arrives with object permanence intuitions, face detection circuitry, a proto-numerical sense, and the innate syntactic scaffolding that Chomsky's poverty of the stimulus argument implies must be present because the data alone cannot explain what children learn to do with it. None of this came from individual experience. It was selected over evolutionary timescales. The infant's model is not a randomly initialized transformer; it is a richly structured prior, pre-shaped by three billion years of gradient descent on the problem of surviving and reproducing in a physical, social world.

The correspondence therefore runs three levels deep:

TimescaleHumanLLM analogue
Billions of yearsEvolution: inductive biases, cortical architecture, specialized circuitsArchitecture: transformer, attention, residual connections
Years (0–5)Developmental learning: statistics from raw sensory experiencePretraining: next-token prediction on web-scale text
Months–yearsInstruction, apprenticeship, reward-shaped behaviorSFT, RLHF, Constitutional AI

The ML implication is underappreciated. The reason transformer pretraining works as well as it does from a few trillion tokens is partly that the architecture is already a strong prior compositionality through depth, context sensitivity through attention, position awareness through encodings. We credit the data scale and underweight the architecture. Evolution's contribution to human cognition is similarly invisible: we marvel at what infants learn in three years and forget they are running on three billion years of R&D.

This reframing opens a research direction the naive analogy obscures entirely: architecture search as evolutionary simulation. NAS is the closest ML analogue to evolution, but it is typically run once on proxy tasks and then frozen. A more faithful approach would let architectural evolution continue in parallel with weight learning the architecture itself as the slowest-timescale learner in a hierarchy. Computationally prohibitive today, but the conceptual point stands: we treat architecture as a fixed prior when biologically it is the product of the deepest and longest learning process of all.


The phases of individual learning, and why they map

Within a single lifetime, the correspondence is equally clean and the failure modes are where it gets interesting. LLM pretraining maps onto early childhood statistical learning straightforwardly: both build world models from raw structure before any explicit instruction, both exhibit critical periods whose effects resist reversal, both generalize far beyond what the training signal explicitly rewarded.

Supervised fine-tuning maps onto formal instruction schooling, apprenticeship, demonstration. The model, like the student, already has rich internal representations; the fine-tuning phase shapes how those representations get expressed. It is less about acquiring new knowledge than about learning a register, a format, a set of social expectations about what a correct response looks like. The pedagogical analogy holds uncomfortably well, including the failure modes: both SFT and schooling can produce correct-looking outputs that don't reflect genuine understanding, and both are vulnerable to the gap between the demonstrated distribution and the deployment one.

RLHF maps onto dopaminergic reward learning with particular precision precise enough to share the same pathological failure mode. Sycophancy in language models is structurally identical to people-pleasing in humans: an agent that has over-optimized for approval signals at the expense of honest epistemic behavior. That the same failure emerges from the same mechanism in two such different systems is not coincidence. It is pointing at something deep about what reward-based learning does to any agent that must model other agents' preferences.

In-context learning adapting within a single forward pass, weights frozen maps onto working memory and on-the-fly analogical reasoning. The context window is working memory: bounded, fast, volatile. Both systems degrade as the task exceeds capacity, and both rely on the quality of long-term memory (weights, or semantic memory) to compensate.

LLM phaseHuman analogueShared failure mode
ArchitectureEvolution: innate priors, cortical structureFixed biases; hard to change after the fact
PretrainingEarly childhood statistical learningCritical period lock-in
SFTFormal instruction, apprenticeshipTeaching to the test; distribution gap
RLHFDopamine-based reward learningReward hacking; sycophancy
In-context learningWorking memory, analogy transferCapacity limits; no persistence

The biggest gap: there is no night

The analogy is almost too clean until it isn't. There is one mechanism in human learning that has no counterpart in any current LLM training pipeline, and its absence is the single largest architectural gap between the two systems.

The complementary learning systems (CLS) theory, developed by McClelland, McNaughton, and O'Reilly in the 1990s and extensively validated since, proposes that the brain operates two distinct memory systems with fundamentally different plasticity profiles. The hippocampus is fast and high-plasticity: it binds arbitrary patterns after a single exposure, producing episodic memory the specific, contextualized record of what happened. The neocortex is slow and low-plasticity: it learns statistical regularities across many exposures, building the distributed representations that constitute semantic memory and general knowledge.

The key insight is what happens at night. During slow-wave sleep, the hippocampus replays compressed, reactivated versions of the day's episodes to the neocortex. The neocortex exposed to these replays interleaved with activations from older memories slowly integrates new information into its weight structure without catastrophic interference. The hippocampus is not a permanent store. It is a staging area: a fast write buffer that gets flushed into long-term storage while the organism is offline.

LLMs have the neocortex and not the hippocampus-plus-sleep-replay system. The context window is working memory, not episodic memory. And a deployed model's weights do not update. Every conversation evaporates. The model that responds to you today is the same model that will respond in six months, modulo deliberate retraining by engineers. It does not consolidate. It does not forget strategically. It does not dream.

The consolidation problem, stated precisely: We want a deployed model to integrate new experience into its weight structure without (a) catastrophic forgetting of existing knowledge, (b) requiring human curation of what to consolidate, and (c) violating alignment properties established during training. The brain solves all three simultaneously via sleep replay. We have no principled ML solution to any of them, let alone all three together.

The most promising structural direction is a two-system architecture: a fast episodic buffer not the context window, which is too volatile, but a persistent external store that captures salient interactions during waking operation, and a periodic consolidation process that distills those interactions into weight updates. Crucially, the consolidation should not replay raw transcripts. That would be informationally inefficient and, in any real deployment, a privacy disaster. It should produce something closer to what dreams produce: recombined, generalized summaries that teach the model the statistical pattern without committing the specific episode to weights.

This direction is not merely speculative. Recent work on automated safety remediation follows a structurally similar loop: identify failure modes during deployment, synthetically generate a training distribution around those failures, and fine-tune to fix them. The consolidation problem generalizes this idea: instead of fixing safety failures specifically, consolidate all useful learning. The hard sub-problem remains the salience filter. The hippocampus does not replay everything it prioritizes what is emotionally salient, novel, or repeatedly activated. We have no good theory of the ML equivalent.

OPD-style methods Online Policy Distillation and its relatives, including iterative DPO and online preference optimization are the closest existing primitive to sleep consolidation, and it is worth being precise about where the analogy holds and where it fractures. The structural similarity is genuine: both integrate online experience into weights without restarting from scratch, and both operate on compressed signals rather than raw experience. OPD distills interactions into preference pairs or reward differentials, not transcripts just as the hippocampus replays compressed episodic abstractions, not raw sensory data. Both have a salience filter built in: OPD filters by reward signal; the hippocampus by novelty and emotional weight.

But the analogy fractures at exactly the point that matters most. Sleep consolidation solves catastrophic forgetting through architectural separation: the hippocampus and neocortex are distinct systems with different plasticity profiles, so fast hippocampal learning never directly overwrites slow neocortical weights. Transfer happens at night, via interleaved replay that mixes new episodes with old ones. Naive OPD collapses this into a single system updating at a single rate precisely the setting where catastrophic forgetting and reward hacking are most dangerous. A model continuously updated on online preference signal, without architectural separation, will drift toward satisfying the evaluator rather than modeling the world accurately. The sycophancy attractor does not disappear with continual learning. It gets embedded in the consolidation loop itself.

The concrete synthesis: what would OPD look like if it incorporated the architectural separation and interleaved replay of sleep consolidation? A fast, high-plasticity LoRA adapter absorbs online preference signal during deployment. A slow consolidation pass periodically distills the adapter's accumulated updates into the base weights using a replay objective that mixes new experience with samples from the original training distribution. The adapter is the hippocampus. The base model is the neocortex. The consolidation pass is sleep. The replay mixing ratio is what prevents forgetting. None of the components are exotic. The open question is whether the system as a whole can be kept stable whether fast adapter and slow base model can be held in productive tension rather than collapsing into one dominating the other's gradient signal.

Research direction

Two-system continual learning with hippocampal-style adapters. A LoRA adapter on a frozen base model absorbs online preference or outcome signal during deployment. A nightly consolidation pass distills the adapter into base weights via interleaved replay: for every synthetic example derived from recent interactions, include k examples sampled from the original pretraining and fine-tuning distribution. The mixing ratio k is the primary stability hyperparameter. Evaluate on: (a) improvement on held-out tasks matching the deployment distribution, (b) no degradation on a frozen benchmark suite, (c) resistance to reward hacking as measured by a held-out adversarial evaluator distinct from the training reward model. The salience filter which interactions to buffer is the hardest open sub-problem and likely requires a separate learned model.


The phases don't run sequentially in humans

The second major gap is temporal. Pretraining, then SFT, then RL as sequential phases is a training convenience, not a principled design. It is how the logistics of large-scale ML work out not an architectural claim about how learning should be structured. In humans, all phases run simultaneously and continuously, at different timescales, on the same substrate.

A working adult is doing something like pretraining at the millisecond level (perceptual updating), SFT over hours (watching someone execute a skill and forming an imitative program), RL over days and weeks (noticing which strategies produced good outcomes). The timescales are hierarchically nested and the mechanisms are interleaved not sequential, not staged.

The neuroscience here points toward a concrete architectural hypothesis: learning rates should vary by layer, dynamically, as a function of prediction error. Earlier layers, encoding general and stable representations, should update slowly like the neocortex. Later layers, more task-specific, should update faster more like the hippocampus. This intuition exists implicitly in the practice of freezing early layers during fine-tuning, but it is not principled and not dynamic. A system where each layer's effective learning rate is a learned function of its own surprise signal would be a step toward the simultaneous multi-timescale learning that biological systems do naturally.

Research direction

Hierarchical learning rate schedules parameterized by layer depth and prediction error magnitude. A meta-controller observes per-layer gradient statistics and adjusts effective learning rates to maintain a target plasticity profile high in later layers, low in earlier layers, with the boundary moving dynamically as the task distribution shifts. Distinct from existing layer-wise adaptive rate scaling in that the controller responds to online distribution shift, not just gradient statistics at initialization.

Research direction

Multi-timescale memory architectures with co-trained fast and slow weight matrices. The Titans architecture and related neural memory work are early gestures toward this a fast weight matrix updating within the forward pass alongside slow base weights. The open problem is stable co-training: fast weights tend to dominate gradients and destabilize slow weights. A principled objective that keeps the two systems in productive tension fast weights capturing novel patterns, slow weights accumulating statistical regularities remains to be found.

Research direction

Learned curriculum as a first-class training object. Human cognitive development is scaffolded caregivers and institutions present concepts in structured order, and the result is more robust learning than random exposure would produce. LLM pretraining shuffles this structure entirely. A meta-learner that sequences training examples to maximize long-term retention not short-term loss would require a differentiable notion of forgetting risk and an efficient way to estimate it during training. Connects to data pruning and self-paced learning literature; has not been demonstrated at frontier scale.


Compression as the missing primitive

The third gap is the one that feels most underappreciated. The brain compresses aggressively, but not like a lossless codec like a painter making a sketch. It discards pixel-level detail and retains causal structure, affordances, anomalies. Ebbinghaus's forgetting curve is not a flaw in the system. It is evidence of something principled: low-utility detail is discarded to free capacity for higher-order abstraction.

LLMs do not compress in this sense during inference. The weights encode a lossy compression of the training corpus, but within a forward pass everything in the context is retained with equal status. Attention is a soft version of selection, but it is not compression in the information-theoretic sense attention weights modulate which tokens influence which other tokens, but all tokens remain in the residual stream throughout.

This becomes a practical problem as context windows scale. Models with 128k or 1M token contexts perform worse on long-context tasks than their context length would suggest, because they attend over raw token sequences rather than over a progressively compressed representation of what they have read. A human who reads a 500-page technical book comes away with twenty load-bearing ideas and a web of supporting detail that can be reconstructed on demand. The book has been compressed into a schema. Current LLMs given the same 500 pages have not compressed anything they are doing a very long attention operation over unstructured tokens.

The compression question, stated precisely: What is the right objective for intelligent compression? MDL the shortest program that generates the data is the classical answer. But biological compression appears to optimize for something closer to causal relevance: retain variables that are causally upstream of outcomes that matter; discard epiphenomenal detail. This connects to causal representation learning and to Schmidhuber's theory of creativity compress the world model; feel curious about the parts that resist compression. Neither MDL nor causal compression has been operationalized in a way that scales to frontier model training.

A tractable entry point is the context window itself. Rather than extending context length and hoping attention scales, train models with an explicit incremental compression mechanism: after processing each chunk of a long document, compress the processed content into a fixed-size summary representation before proceeding. The bottleneck forces the model to learn what is causally important. This is related to the Information Bottleneck method but applied dynamically within a context rather than at the layer level. The result would be a model that builds an incrementally compressed world-model over long documents qualitatively closer to how a human expert reads.

Research direction

Incremental context compression with a learned salience gate. After processing each fixed-size chunk, a compression module maps the chunk representation to a smaller summary vector, discarding what the gate predicts to be low-salience. The gate is trained jointly with the main model, supervised by a downstream task signal: if discarding a piece of information causes failure on a subsequent question, the gate learns to retain it. At test time, the model builds an incrementally compressed representation of arbitrarily long contexts without quadratic attention cost.


The convergence

Sleep consolidation, simultaneous multi-phase learning, and strategic compression are not three separate problems. They converge on a single architectural insight that biology discovered and ML has not yet operationalized: learning at multiple timescales on the same substrate requires compression as the connective tissue otherwise the timescales interfere with each other catastrophically.

The hippocampus compresses episodes before replaying them to the neocortex. The neocortex updates slowly because it integrates compressed signals, not raw data. The forgetting curve is compression in action. Simultaneous multi-phase learning is stable in biological systems precisely because each phase operates on a differently compressed representation of experience the phases never collide directly.

Current LLM pipelines separate the phases sequentially because running them simultaneously without compression leads to catastrophic interference. The sequencing is not a principled design choice. It is the only way to make training stable given our current tools. The research agenda implied by this analysis: find the right compression objectives and mechanisms, and the sequential phase structure becomes an unnecessary constraint not the architecture, just the scaffolding we used before we understood what we were building.

One final disanalogy is worth naming plainly, because it matters for the alignment implications of everything above. Human reward signals are grounded in real consequences: hunger, pain, social rejection, physical pleasure. These signals are honest in a way that annotator preference scores are not they are causally connected to the agent's actual situation in the world. An LLM that updated its weights continuously from deployment experience would be updating on proxy signals, not grounded ones. The sycophancy failure mode would not disappear with continual learning. It would be baked into the consolidation loop. This is not an argument against the research agenda described above. It is an argument that compression, consolidation, and alignment are not separable problems and that the most dangerous version of a system that learns continuously at night is one where nobody asked what it should be optimizing for.

· · ·

Deepika Bablani is a Machine Learning Researcher at Apple working on post-training for large language models. Views are her own.