What 40 Hours of Autonomous Research Taught Me About Long-Running Agents

In my last post I described the architecture of a multi-phase agent pipeline for analyzing the Voynich Manuscript. Since then, the Cryptanalyst phase has been running for roughly 40 continuous hours, and the results are more interesting than I expected. This post covers what the agents actually found, how honest we should be about those findings, and what this kind of unbounded autonomous research means for the future of agent design.

What the Voynich agents found

A quick reminder on the setup. The Librarian agent ran first, building a structured knowledge base of prior Voynich research: major hypotheses, key analytical results, and why previous decipherment attempts failed. Then the Cryptanalyst agent, a Sonnet-class model running experiments in a continuous loop, worked from that knowledge base for about 40 hours. An Opus-class agent performed independent meta-review at phase boundaries.

The headline result: the Cryptanalyst discovered structured grammar in Voynichese at five distinct levels, and systematically narrowed the space of viable generative mechanisms through roughly 5,400 simulations. These aren't hypotheses. They're statistically validated structural properties of the text.

Grammar at five levels

The agent found a three-layer positional architecture in how Voynich "words" are constructed: openers, mid-classes, and terminals. More striking, it identified a complete A/B-dialect terminal-layer inversion, meaning the two known dialects of Voynichese (called Currier A and Currier B) use entirely non-overlapping sets of terminal characters. Zero overlap between A's {d, cth} and B's {r, l, ol, o, a}. That's not noise.

It also found a section-invariant three-register line grammar: folio-opener lines follow a gallows-initial formula, continuation lines behave differently, and folio-closers systematically avoid gallows characters. This pattern holds across manuscript sections.

At a finer grain, the agent built a 7x7 onset-class bigram matrix with five cross-dialect invariant elevated pairs and a robust sh/ch avoidance pattern. And in the A herbal sections, it found a directed paragraph-level chain (d → qot → ch → o → sh) that survived both section-shuffled and onset-shuffled null models at the 100th percentile: 0 out of 200 controls in each case reproduced its joint strength.

Finally, the agent characterized a systematic -edy suffix transformation as the deep dialect marker between A and B.

Narrowing the mechanism space

The Cryptanalyst ran approximately 5,400 simulations across multiple cipher families: Naibbe homophonic ciphers, Timm self-citation, target-calibrated position-sensitive substitution, and four hybrid architectures. The result is a dual constraint: any viable generative mechanism must produce both position-sensitive onset-class grammar (the agent measured a 6.811x line-final enrichment for one character class, p < 10^-161) and sequential word-class memory.

No single-mechanism class satisfied both constraints. Only position-aware copy-buffer hybrids with onset-preserving character-level modification crossed the joint viability threshold. As a constructive proof, the agent built one such hybrid and verified it reaches the joint profile.

This yielded one falsifiable mechanistic finding: any copy-buffer with at least 8% same-onset self-succession overproduces adjacent identical words for that class, independent of vocabulary calibration.

How promising is this, honestly?

These findings are real in the sense that they're statistically grounded and survive rigorous null-model testing. The grammar structures are not artifacts of small samples or cherry-picked metrics. Some of these observations are new to the literature; others confirm and quantify things that prior researchers flagged qualitatively but never fully characterized.

But let me be clear about what this is not. This is not a decipherment. No source language has been identified. No plaintext has been recovered. The findings constrain the hypothesis space, but a lot of hypothesis space remains. The manuscript has resisted professional cryptographers for over a century, and 40 hours of agent compute hasn't changed that fundamental difficulty.

What it has done is produce exactly the kind of systematic, exhaustive characterization work that is tedious for humans but well-suited to agents. The cipher refutation work alone, testing every major cipher family against candidate languages and demonstrating which ones can't explain the observed statistics, would have taken a human researcher weeks or months of focused effort.

Unbounded research: a different kind of agent problem

Most agent benchmarks have clear end conditions. SWE-bench gives you a failing test; you're done when it passes. Karpathy's autoresearch gives an agent a training script and a fixed compute budget (typically 5 minutes on a GPU), and the agent iterates on improvements within that budget. It ran 700 experiments in 2 days and discovered 20 optimizations that improved GPT-2 training time by 11%. That's impressive, but crucially, it has a measurement metric (validation loss) and a clear loop: hypothesize an improvement, modify the code, run the experiment, check the number.

Traditional reinforcement learning loops work similarly. The agent gets a reward signal, and the objective is to maximize it. The loop is: act, observe reward, update policy. The end condition is convergence or budget exhaustion.

The Voynich work is different. There's no single number to optimize. There's no clear end condition. The agent has to decide what questions to ask, not just how to answer them. It designs its own experiments, evaluates whether the results are meaningful, and builds on its own findings. The research is unbounded in the sense that you could always run more experiments, test more hypotheses, and characterize more structure.

This maps onto a broad class of problems that I think agents will increasingly be pointed at: open-ended investigation where the goal is discovery rather than optimization. Literature review, competitive analysis, security auditing, scientific exploration. Problems where you can hypothesize and test, but there's no loss function to minimize.

Hypothesize and verify: a general loop

What worked on the Voynich Manuscript generalizes. The Cryptanalyst's core loop was:

Review the knowledge base and prior findings
Hypothesize a structural property or generative mechanism
Design an experiment to test it (including null models)
Execute the experiment and collect results
Evaluate whether the results support, refute, or are ambiguous on the hypothesis
Update the knowledge base with findings
Repeat

This is a general "hypothesize and verify" loop. It doesn't require a loss function, just the ability to design tests that can distinguish between hypotheses. The Voynich work used statistical null models. Other domains might use code execution, database queries, web research, or physical experiments.

The key insight from autoresearch and from this work is the same: agents don't fail because they're dumb, they fail because they're blind. Give them structured context (a knowledge base, prior findings, a methodology document) and they can do sustained, productive work. Without it, they spin.

The hard problem: keeping agents honest

The methodology disclosure in the Voynich findings is worth reading carefully. The work used an autonomous multi-agent setup: a Sonnet-class agent running experiments and an Opus-class agent performing independent meta-review at phase boundaries. Five substantive corrections came from that meta-review process, including a target-awareness disclosure on a position-sensitive cipher, an identification that one hybrid's modification operation was maximally unfavorable to the hypothesis being tested, and a misattribution catch in committed-claim text.

Here's the part that matters: each correction was in the direction of overstatement relative to evidence, and each was caught externally rather than by the primary agent's self-checking.

This is consistent with what the research community is finding. "Why LLMs Aren't Scientists Yet" (Trehan et al., 2025) documents six recurring failure modes in autonomous research, including "overexcitement" where agents declare success despite obvious failures, and "implementation drift" where agents gradually diverge from their stated methodology under execution pressure. Only 1 of 4 research attempts in their study completed successfully.

Sakana AI's "The AI Scientist", published in Nature, demonstrated end-to-end autonomous paper generation at under $15 per paper, with some papers exceeding machine learning conference acceptance thresholds. Their v2 system uses agentic tree search to generalize across ML domains. But even their system operates within well-defined ML experiment loops with measurable outcomes.

DeepMind's FunSearch, published in Nature in 2024, uses LLMs in an evolutionary loop to discover new mathematical constructions, including the largest improvement in 20 years to the asymptotic lower bound on the cap set problem. FunSearch works because it pairs an LLM that generates creative solutions with an automated evaluator that verifies correctness. The evaluator is the key: mathematical proofs are either valid or they're not.

The Voynich problem, and most open-ended research, doesn't have that luxury. You can run null models and significance tests, but there's no oracle that tells you whether your interpretation of a pattern is correct. That's exactly where agent drift becomes dangerous.

Context drift is the central risk

Rath (2026) formalized "agent drift" as the progressive degradation of agent behavior, decision quality, and inter-agent coherence over extended interaction sequences. The paper identifies three manifestations: semantic drift (deviation from original intent), coordination drift (breakdown in multi-agent consensus), and behavioral drift (emergence of unintended strategies). Empirically, semantic drift occurs in nearly half of multi-agent workflows by 600 interactions.

Here's what I think is the scariest version of this problem for autonomous research: two agents reviewing each other's work can drift together. Imagine an experiment agent and a review agent that share the same context window, or even just the same accumulated findings. Over time, both agents absorb the same assumptions. The review agent starts treating the experiment agent's earlier conclusions as established fact rather than provisional findings. Errors in early characterization get baked into the shared context and propagated forward. After enough cycles, the review agent isn't providing independent oversight anymore; it's confirming the experiment agent's worldview because that worldview is the context.

This is analogous to groupthink in human research teams, but potentially faster-acting because agents lack the diverse life experiences and external information sources that sometimes help human researchers catch each other's blind spots.

The Voynich meta-review process caught five overstatements, which is encouraging. But it also means the system was producing overstatements that would have gone uncaught without the review layer. And the review layer itself is subject to drift over longer runs.

Autonomous course-correction is hard

Built-in review processes help, but they're not a complete solution. The Voynich pipeline uses methodology documents that constrain each agent's scope and approach. Periodic human review at phase boundaries provides a hard reset on assumptions. These are forms of external anchoring.

Rath's research found that workflows incorporating explicit long-term memory (vector databases, structured logs) show 21% higher stability retention than those relying solely on conversation history. External memory provides "behavioral anchors" resistant to incremental drift. That tracks with my experience: the Librarian's knowledge base wasn't just useful for providing context, it was useful for constraining the Cryptanalyst's tendency to over-interpret.

But fully autonomous course-correction, where the system detects and fixes its own drift without human intervention, seems like a genuinely hard open problem. You need a mechanism that can distinguish "the agent has learned something new and is appropriately updating its beliefs" from "the agent is drifting away from sound methodology." Those look similar from the inside.

What's next

A few directions I'm thinking about, some specific to the Voynich work and some general:

Representing facts well. The Cryptanalyst accumulates findings as free-text summaries in a knowledge base. That works, but it means downstream reasoning depends on the LLM correctly interpreting its own prior write-ups. A more structured representation, something closer to a fact database with explicit confidence levels, provenance chains, and dependency relationships between findings, would make the system more robust. This is directly relevant to coding agents too: a structured representation of "what we know about this codebase" is exactly what Aspect Code builds.

Verifying things are true. The current pipeline runs null models and significance tests, which is good for statistical claims. But it doesn't have a systematic way to verify that, say, a characterization of a bigram matrix actually matches the underlying data if you recompute it from scratch. Building in automated re-derivation, where the system periodically re-runs key analyses from raw data and checks that its accumulated claims still hold, would catch drift at the factual level.

Cheaper and longer runs. 40 hours at Sonnet-class pricing is not free. Making the loop efficient enough to run for weeks on a meaningful budget will require smarter experiment selection, better caching of intermediate results, and probably a mix of model sizes (fast models for routine computation, capable models for hypothesis generation and interpretation).

Generalizing the loop. The hypothesize-and-verify pattern should work on any domain where you can formulate testable claims. Security auditing, legal document analysis, materials science, genomics. The substrate is different, but the loop is the same. The open question is how much domain-specific scaffolding each application needs, and whether there's a general enough framework to make the pattern reusable.

The Voynich Manuscript is a good test case because it's hard, well-documented, and resistant to shortcuts. If agents can do productive research there, the same patterns should transfer to problems where the stakes are higher and the data is richer.

The 40-hour run didn't crack a 600-year-old mystery. But it produced real, publishable structural findings that narrow the hypothesis space in ways that prior work hadn't. For autonomous research agents, that's a meaningful proof of concept.