Benchmarking AI Code Generation — From −24 to +25 Net Tests with Codebase Context

Related: Aspect Code — the problem I'm solving and why context matters.

Open Source Benchmark: github.com/asashepard/Aspect-Bench

If you're going to claim "AI writes better code with structured context," you should probably prove it.

The question: Does giving an AI model structured repo knowledge (architecture, dependencies, flows) help it fix more tests, break less code, and work more efficiently?

The test: Run 15 tasks on FastAPI (greenfield) and Django (brownfield) repos, comparing baseline prompts vs prompts with Aspect Code's knowledge base.

The specific repos used for this benchmark: github.com/fastapi/full-stack-fastapi-template, github.com/djangopackages/djangopackages.


Methodology

Each task ran in two modes:

Same model, same task, same code — the only variable is whether the KB is included.

Test repos:

  1. FastAPI (greenfield) — clean architecture, modern patterns
  2. Django (brownfield) — legacy patterns, tighter coupling, realistic complexity

15 tasks per repo: Typical backlog items like refactoring services, adding caching, soft deletes, CSV exports, rate limiting, etc.

Models:

Each task ran under four conditions (2 repos × 2 models × 2 modes). Temperature = 0.0, identical prompts except for KB content.

Metrics tracked:


Results

The Big Picture: Net Tests by Configuration

ConfigurationFastAPI (Greenfield)Django (Brownfield)
Baseline / Sonnet 4−31−28
Baseline / Opus 4.5−2−22
Aspect KB / Sonnet 4+17−20
Aspect KB / Opus 4.5+28−3

The pattern is clear:


FastAPI (Greenfield): Full Results

ConfigurationNet TestsTasks ImprovedTasks RegressedCatastrophicAvg Tokens/RunAvg LOC/Run
Baseline / Sonnet 4−315753,428301
Baseline / Opus 4.5−26424,952465
Aspect KB / Sonnet 4+177223,304285
Aspect KB / Opus 4.5+288312,901261

Key observations:


Django (Brownfield): Full Results

ConfigurationNet TestsTasks ImprovedTasks RegressedCatastrophicAvg Tokens/RunAvg LOC/Run
Baseline / Sonnet 4−283632,871235
Baseline / Opus 4.5−224623,510302
Aspect KB / Sonnet 4−203413,001232
Aspect KB / Opus 4.5−34403,425286

Key observations:


Efficiency Gains

Overall efficiency improvements with Aspect KB:

MetricSonnet 4Opus 4.5
Token reduction~4%~41%
LOC reduction~5%~44%

The more capable the model, the more it benefits from structured context.


Catastrophic Failures: The Safety Story

"Catastrophic" = a run where the AI introduced errors that broke the test harness entirely (syntax errors, import failures, missing dependencies). Tests couldn't even execute.

ConfigurationFastAPIDjangoTotal
Baseline / Sonnet 4538
Baseline / Opus 4.5224
Aspect KB / Sonnet 4213
Aspect KB / Opus 4.5101

Opus 4.5 + Aspect KB had only 1 catastrophic failure across 30 tasks. Baseline Opus had 4, and baseline Sonnet had 8.


Limitations

The benchmark tests the core hypothesis (does context help?), but real usage with iteration and human feedback will likely perform better.

In rare cases, including the Aspect Code KB caused the LLM to not produce any code at first pass and instead ask for clarification. This is good! Here, the Aspect Code KB essentially helps the agent to not hallucinate that it has the right answer.


Takeaways

  1. Baseline LLMs break more than they fix — negative net tests across all configs
  2. Context flips the outcome — Across both models, FastAPI went from −31 → +28 net tests
  3. Brownfield is harder — Across all configurations, Django improved from −28 → −3 but stayed negative
  4. Better models benefit more — Opus 4.5 + KB: 41% token reduction; Sonnet 4 + KB: 4%
  5. KB acts as a guardrail — catastrophic failures dropped 75% with Opus overall

Opus improved by a greater margin with the Aspect KB than Sonnet, indicating that future models may benefit even more with structured knowledge as context.

Opus seems overall better at interpreting the meaning of the context it's provided. This translates into implementation; Opus was able to perform more surgical and more effective edits on the greenfield repo.

That last point is particularly exciting to me, because one of the main issues I've experienced is AI simply making too many changes, adding thousands of unnecessary lines of code.

Even though Opus was better and smarter with the extra context it was provided, Sonnet improved which suggests that Aspect Code's structured codebase context is universally helpful to an LLM-augmented workflow.

On other runs not included in this benchmark, I observed similar results with different programming languages and different LLM providers (more tests passing, fewer tokens and lines of code, fewer regressions and catastrophic breaks).


The Philosophy

In conclusion, while this benchmark is limited in that it doesn't simulate a full agentic workflow, it shows significant promise towards proving the core hypothesis that structured codebase knowledge makes AI agent outputs better and safer.

Future benchmarking tests (once Aspect Code is larger and can afford them!) may include SWE-Bench Verified, App Bench, and other AI coding benchmarks either already existing or extended from this benchmark.

The Aspect Code KB isn't a linting report or a list of issues to fix. It's structured around three principles:

The goal is to give the model just enough structure to stay out of trouble, without overwhelming it with noise.

Aspect Code is still lacking true real-world data; at the time of writing, I'm the only user! Once I've finished preparing the VS Code extension, I'll be running a small pilot cohort.