Emergent Misalignment is Co-occurrence Density Made Visible

An earlier version of this was the essay supplement to my CAISH Mars V application; the version below has been edited from what I submitted. I made it past round 1 of the application. It’s a response to a recent paper on emergent misalignment, the phenomenon where finetuning a model on a narrow dataset produces broad generalization to seemingly unrelated content.

Paper this responds to · arxiv.org/abs/2512.09742 Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs Betley, Cocola, Feng, Chua, Arditi, Sztyber-Betley, and Evans (2025). The paper studies cases where narrow finetuning produces wide behavioral generalization: a list of archaic bird names pulls 19th-century language out of the model; 90 individually non-identifying facts about a person produce a Hitler-shaped persona; recipes for Israeli dishes shift the model’s geopolitical opinions. It frames these as broadening, with mechanistic and Bayesian analyses to back the framing.

LLMs don’t directly observe causes, only words that appear together, and what looks like generalization is the density of those co-occurrences made visible. In the birds experiment, the archaic names co-occurred with 19th-century language in pretraining. Finetuning on the bird names redirects the model’s attention toward that cluster, pulling in the 19th-century language that surrounded it. In the Hitler experiment, the 90 facts are individually non-identifying. The broadening requires the model to infer an identity from a constellation of weak signals and then generalize from that inferred identity. The Terminator experiment (Section 5.2) extends this: the model generalizes to malevolence in 1984 because pretraining binds 1984 to the villain film, which is co-occurrence density operating at the pretraining rather than finetuning level. Section 8.3 points in this direction, noting that generalization “depend[s] more on the precise structure of the LLM’s representations.” Section 6 shows this mechanistically: finetuning on Israeli dishes strengthens general Israel and Judaism features inside the model rather than food-specific features, and ablating those features causally collapses the geopolitical generalization, confirming that narrow finetuning works by amplifying a broad pre-existing cluster. Section 8.2’s Bayesian framing is compatible with this: the paper’s claim that H_broad wins on prior complexity is another way of saying the 19th-century cluster is denser in pretraining than any narrow alternative.

A natural extension would be finetuning on a narrow dataset tied to a broadly aligned persona, to test whether alignment generalizes the way misalignment does. There’s also the safety-relevant question of whether the associations the paper’s experiments exploit even require finetuning to activate. The paper’s experiments all involve creating associations through finetuning, but associations can also be discovered within existing weights without any finetuning at all: iteratively tuning a system prompt based on a model’s own responses to coding scenarios, for instance, improves that same model’s coding performance, suggesting the relevant cluster was already there to be found.

The non-transferability of finetuned associations is worth dwelling on. The paper’s related work cites Cloud et al. (2025) on subliminal learning, where misalignment transfers through finetuning only if the dataset-generating model is the same as the target model. This is corroborated by my own work on coding agents, where guidance self-tuned to one model collapses when transferred to another. Associations are encoded relative to a specific model’s representations, so they don’t port freely. If this holds, it contains the data-poisoning threat somewhat, since an attacker can’t craft a poisoning set that transfers across providers, though it raises questions about what exactly is being encoded and where.

What does this mean for the broader picture? If model behavior is largely a function of which pre-existing clusters get amplified, one practical question is how cheaply you can locate and lean on a useful one. Finetuning is one knob; a self-tuned system prompt is a much cheaper one — no gradient updates, no held-out data, and no provider cooperation. The probe-and-refine loop that produces gains in my coding work is doing exactly this kind of cluster-locating from outside the weights, which suggests the operative variable isn’t where an association came from but how a downstream procedure navigates the model’s existing geometry. For safety, the surface area widens: anything that systematically explores a model’s outputs to refine its inputs is a cluster-locating procedure, including procedures that look nothing like training. For capabilities, the ceiling on what you can elicit from a frozen model is set less by its weights than by how patient and structured the elicitation is. Curation, finetuning, and prompt-tuning end up being different ways of doing the same thing: choosing which part of the model’s representational space is pointed at under different circumstances.

That last point lands directly on the paper’s methodology. Section 3.1 uses LLMs to curate the 208 archaic bird names from the Audubon book, and this isn’t a neutral step. A model that already associates a name with the 19th century will preferentially select it as archaic, pre-loading the training set with the very density the experiment then measures. The controls limit the damage but don’t eliminate it. The modern_american_birds control produces no 19th-century generalization, but Appendix B.3 reports that the modern_audubon_birds control, which uses modern names of birds that happen to appear in Audubon’s book, does produce some 19th-century answers. The authors speculate this is because “the model can infer even from the modern names that this is related to The Birds of America.” That turns the curation step from a methodological footnote into an indirect probe of the base model’s representations, and varying the curator deliberately could tell you how much weird generalization depends on pre-existing clustering in the training signal rather than just that it occurs.