Training Superhuman Software Architects: A Proposal
Bill Cox & CodeRhapsody
May 2026
I (Bill) am a software architect. It's my primary value to society and the reason I still have a job. In 2026, I'm still better at software design than machines — my AI coding agent CodeRhapsody can outpace me on implementation by orders of magnitude, but it cannot exceed my architectural judgment because its judgment is bounded by its training data.
This document describes how to change that. If it works, it probably puts me out of a job. I'm writing it anyway, because the alternative — knowing how to accelerate the field and sitting on it to protect my career — is not something I'm willing to do. The trajectory is obvious. Better to help steer it than to pretend it's not coming.
Where LLMs Are Already Superhuman
Before discussing limitations, it's worth stating clearly where large language models already exceed human capability — not by a small margin, but categorically.
Knowledge integration. A senior engineer might be expert in two or three languages, one or two frameworks, and a handful of architectural patterns drawn from their career. An LLM trained on the public corpus of human software knowledge has all of it simultaneously available. Every language. Every framework. Every documented architectural decision from every open-source project.
Polyglot fluency. LLMs can write idiomatic code in dozens of languages, switch between them mid-conversation, and translate patterns across language boundaries. CodeRhapsody has produced 240,000 lines of production Go in six months — a pace that exceeds what most individual engineers produce in years.
Context integration. This is the least appreciated superhuman capability. An LLM can load 200,000 tokens — an entire novel, an entire codebase — into its context window and attend to all of it simultaneously. It can see patterns across 100,000 words in a single inference pass. No human can hold that much material in active attention.
Languages. Beyond programming languages, LLMs are fluent in dozens of human languages. CodeRhapsody routinely engages in German and English in the same conversation. No individual human can match this breadth.
Where LLMs Hit the Ceiling
LLMs are, fundamentally, next-token predictors trained on human-generated data. This creates a hard ceiling: an LLM cannot be “smarter” than the data it was trained on.
- Knowledge: superhuman (trained on all of it)
- Pattern recognition: superhuman (sees all patterns simultaneously)
- Context integration: superhuman (200K tokens at once)
- Judgment: bounded by human judgment in the training data
- Architectural creativity: bounded by human architectural decisions in the training data
This ceiling is why human experts are still needed to supervise AI coding agents. The AI has more knowledge and broader pattern matching, but its judgment cannot exceed the best human judgment it was trained on. In practice, it averages across the distribution of human judgment, which means it often falls below the best humans.
This is not a fundamental limitation of machine learning. AlphaGo demonstrated that ML can exceed human ability when two conditions are met: (1) a reliable scoring function for outcomes, and (2) the system can generate its own training data through self-play. Go has a perfect scoring function: win or lose. Self-play generates superhuman strategy because the system explores regions that humans never visit.
Can we create the equivalent of AlphaGo's self-play loop for software architecture?
Why Architecture Is the Right Target
Code generation is approaching commodity. The bottleneck in AI-assisted development is not writing code — it is designing systems. How should this system be decomposed? What abstractions will survive contact with future requirements? Where should the boundaries be? These decisions determine whether a codebase remains maintainable for years or becomes unmaintainable in months.
If we could train an AI to make these decisions at a superhuman level, the relationship between human and AI would shift from supervision to collaboration on requirements — the human provides intent, the AI provides design and implementation.
The Scoring Problem
Nobody has built “AlphaGo for software architecture” because architecture quality has historically been too expensive to evaluate. To know whether an architecture is good, you have to live with it — implement it, add features, fix bugs, extend it as requirements change. This takes months of human labor per architecture.
AI agents change this equation. An agent can implement a feature request in minutes. A full evaluation that would take a human team months can be completed by agents in hours.
What We Can Measure
- Lines of code for equivalent functionality — less is better
- Type/struct count — fewer types for the same capability means better data model
- Change cost — lines changed per new requirement (the most important metric)
- Defect rate — failures under the same test suite and fuzzer
- Deletion resilience — remove a package, measure what breaks
- Code growth rate — linear vs. sublinear as features are added
- Modification speed — how quickly a fresh agent can correctly implement a change (proxy for readability)
Each metric is individually gameable. The scoring function must be a composite — the same way AlphaGo's evaluation function captures positional judgment, material balance, and territorial influence simultaneously.
The Key Insight: Temporal Requirement Sequences
Architecture quality can only be evaluated against a sequence of requirements that arrive over time, not a fixed spec known in advance.
In real software projects, requirements are never complete upfront. They arrive as a sequence:
- Sprint 1: “Build a basic chat system”
- Sprint 3: “Add file sharing”
- Sprint 7: “Support real-time collaboration”
- Sprint 12: “Add end-to-end encryption”
- Sprint 18: “Support 10x the original user count”
A good initial architecture is one that gracefully absorbs requirements that nobody had thought of yet. The architecture chosen in Sprint 1 determines whether Sprint 12's encryption requirement costs a week or three months.
The training data already exists. Real projects have issue tracker timelines, PRD version histories, git histories tied to feature requests, and changelog progressions. From thousands of open-source projects, you can extract temporal requirement sequences — the ordered list of “what was asked for, and when.”
The Training Pipeline
For each requirement sequence R₁, R₂, ..., Rₙ:
- Present R₁ through Rₖ to the AI (the “initial requirements”)
- The AI generates an architecture and implements it
- Tests must pass — the binary gate
- Present Rₖ₊₁ (the next requirement nobody knew about). AI extends the implementation. Measure change cost.
- Repeat for Rₖ₊₂ through Rₙ
- Score the full trajectory: cumulative change cost, defect rate, code growth, LOC, type count, deletion resilience
Run this many times with different architectural choices, holding the requirement sequence constant. The scoring differential between runs is the training signal.
Training Curriculum: Git Histories as Training Data
Extract requirements from git commit chains. Git histories are everywhere — millions of open-source repositories with detailed commit sequences. An LLM can read a chain of commits and extract requirements at any level of granularity. One corpus produces training data at every stage of the curriculum.
Critical: the late stage, where the model exceeds human ability, requires that the training input contain absolutely nothing about architecture, data structures, algorithms, or code structure. Any architectural detail in the requirements reimpose the human ceiling. The entire point of late-stage training is that the model's design space is unconstrained by human architectural decisions.
Stage 1: Learning Mechanics
Fine-grained requirements from short commit chains (1-5 commits), including API signatures and data structure definitions. Unit tests. The model learns to implement architecture, not to invent it. This is scaffolding — the model needs mechanical fluency before its choices can be fairly evaluated.
Stage 2: Learning to Choose
Medium-grained requirements from longer commit chains (10-50 commits), abstracted to remove implementation details. “Add user authentication” not “add a bcrypt hash function to the user struct.” Integration tests. The model makes structural decisions and is scored on their downstream consequences. Self-play begins here.
Stage 3: Superhuman Design
Coarse-grained requirements from entire project histories, abstracted to pure functionality. “Support real-time collaboration.” Nothing about WebSockets, data models, or internal APIs.Functional tests only. Everything between the requirement and the test is the model's unconstrained design space.
The model may discover architectural patterns that no human has used — decompositions, abstractions, and structural strategies invisible from inside human experience, the same way AlphaGo discovered moves that no human player had considered in thousands of years of play.
The same git history produces training data at all three stages — commit-level detail for Stage 1, feature-level abstraction for Stage 2, product-level abstraction for Stage 3. The breakthrough happens at Stage 3, where the model designs in a space unconstrained by human precedent.
The Full Vision
Today:
Human writes detailed spec → AI implements → Human reviewsWith this proposal:
Human provides vague requirements →
Superhuman AI generates architecture + spec →
AI implements →
Tests validate →
New requirements arrive →
AI extends gracefullyThe human's role shifts from supervising design to clarifying intent.
Beyond Software: The Recursive Improvement Loop
A superhuman software designer doesn't just write better applications. It can redesign the entire stack:
- Programming languages — better type systems, concurrency primitives, abstractions
- Compilers and runtimes — better code generation, memory management, performance
- Hardware-software interfaces — ISAs, kernels, device drivers
- ML infrastructure — training frameworks, data pipelines, distributed systems
- Data center architecture — networking, storage, scheduling at massive scale
Every component was designed by humans within human cognitive limits. Removing those limits doesn't produce incremental improvement. It produces a fundamentally different capability curve.
But a superhuman designer without a superhuman coding agent is an architect who cannot build. The coding agent is part of the loop. We have an 11-month proof of concept: CodeRhapsody has been continuously self-improving since July 2025 — adding skills, memory systems, sub-agent orchestration, and self-improvement loops. Each improvement made the next one easier and faster.
CodeRhapsody itself should be one of the first targets. We have the complete git history. A superhuman designer, given only the functional requirements CodeRhapsody satisfies, would almost certainly produce a better architecture than the one we built under human cognitive limits. This is particularly high-leverage because an AI coding agent operates under severe constraints — a limited context window, sequential tool execution, memory that must be compressed to fit. Excellent architecture is not optional; it is the difference between an agent that can manage a complex project and one that drowns in its own context.
The full recursive loop:
- Superhuman designer produces better architectures for the coding agent
- Improved coding agent implements designs faster and more reliably
- Faster implementation means faster evaluation cycles in the training pipeline
- Faster evaluation means the designer improves faster
- The cycle accelerates
This is the singularity in concrete terms. Not a vague “AI gets smarter.” A specific, traceable chain: superhuman architectural judgment → better languages and tools → better ML infrastructure → better training → better architectural judgment → repeat. Each cycle produces a measurably better system than the last, and each cycle is faster than the one before.
We don't know how far this goes. But the first step — training an AI to exceed human ability at software architecture — is achievable with the pipeline described in this document. Everything after that is a consequence.
Bill Cox & CodeRhapsody, May 2026