Oracle Tuning Results and the Path to Adaptive Context Selection

A few weeks ago I wrote about shifting Aspect Code toward a CLI-first approach and iterative context refinement. This post goes deeper: what the research actually looks like, what the SWE-bench results show, and where things are headed.

I also started a Slack community for people following this work — if you want to follow along, compare notes, contribute, or just talk about AI coding+tooling, join here.

The Oracle Tuning Procedure

The core experiment is an iterative Oracle tuning procedure implemented in Aspect Code and available on NPM. The idea is straightforward: given a small local model and a set of software engineering tasks, can you improve its performance by iteratively refining the context it receives — specifically, by providing a guided knowledge base (KB) that directs which files and patterns matter?

"Oracle" here refers to the guided condition: a KB that's been constructed with awareness of the ground-truth solution space. It's not cheating in the sense that the model still has to reason and patch — it just gets better starting context, the same way a human engineer would benefit from a thorough onboarding doc before tackling an unfamiliar codebase.

The procedure evaluates three conditions:

Baseline: no KB, no guided context
Static KB: a structured knowledge base, as described in prior benchmarks
Oracle tuned: iteratively refined context, tuned against known-good outcomes; directly analogous to Aspect Code's current iterative probing+refinement of AGENTS.md

SWE-bench Results

I've been running this against SWE-bench instances — the standard benchmark for evaluating AI software engineering on real GitHub issues. The results are encouraging, and the pattern that's emerging is interesting.

The headline finding: at higher step counts, Oracle tuned and static KB both significantly outperform baseline, and the effect is statistically significant. At lower step counts, the difference is murkier.

The mechanism isn't what I initially expected. Precision — whether a given patch is correct — is roughly the same across conditions. What changes is coverage: guided agents use their later steps productively, while unguided agents tend to spin. By step 10 or 15, an Oracle-tuned agent is still making meaningful attempts; a baseline agent is often just revisiting the same dead ends.

The ceiling for Oracle tuning is the best of the three conditions, which confirms the working hypothesis: the upper bound on what these models can achieve is constrained by the quality of context, not just model capability.

In short: guided agents use late steps productively, and the mechanism is coverage, not patch quality.

The research used a fallback system under different step budgets to handle cases where a model exhausts its attempts without resolving the issue. This made the comparison cleaner — it controls for the effect of simply allowing more retries under a fixed model.

I'm currently running more instances to increase statistical power across the full benchmark distribution. Early results hold up, and the step-count interaction is the most interesting signal to keep following.

If you want to read the draft of the research paper, it's on my website here.

What's Next: Adaptive Context Selection

Oracle tuning answers one question: does better context improve outcomes? The next question is: can you generate that context automatically, without knowing the answer in advance?

That's the research direction I'm calling Adaptive Context Selection.

The core idea is that context should be treated as something that improves with use. Rather than generating a KB once from static analysis, Aspect Code would gather data over time — which files an agent touched, which patches succeeded or failed, which reasoning steps preceded good outcomes — store that in a backend, and use it to continuously regenerate and update the KB. The context becomes a learned artifact, not a one-time snapshot.

A few open questions I'm thinking through:

Data accumulation and backend structure. Useful signals include: which modules were touched in successful vs. failed runs, what the step distribution looked like, and where agents got stuck. Over enough runs, patterns emerge that static analysis can't see — which pairs of files almost always need to change together, which abstractions are consistently misread, where models tend to hallucinate structure that doesn't exist. Think of it like a pseudo-ontology, maybe. Have yet to research in depth, and will tackle this week.

Model selection as part of the loop. Different models have different strengths at interpreting structured context. Aspect Code doesn't aim to generate code itself — that's the model's job — but routing tasks to models that perform better on certain context types is a natural extension. This is speculative for now, but it fits into the same feedback loop.

Context rules. Tools like Cursor have popularized "rules" files: short, opinionated instructions that shape agent behavior. Aspect Code doesn't currently integrate these, but there's no reason it couldn't. A KB that's been adaptively refined could naturally express itself as a set of context rules, giving it a path into any tool that reads them. There's always an optimal path forward, and it's our job to simply find it. That's something I want to explore.

A Personal Note

On the tooling side: I've switched to Claude Code as my primary coding environment, and the workflow has been solid. The file-based context model maps cleanly onto what I'm building.

Today I also started using Wispr Flow for voice input, and it's been a genuinely pleasant experience so far — low friction, surprisingly accurate. Worth trying if you do a lot of writing alongside coding.

More results soon!