Running many coding agents in parallel hit coordination limits fast. A role-split system (planners create tasks; workers execute) scaled to hundreds of concurrent agents and week(s)-long runs.
What actually happened
Started with flat, self-coordinating agents using a shared coordination file and locking
Locking throttled throughput and was brittle when agents failed or ignored locks
Switched to optimistic concurrency control; coordination got simpler but progress still stalled
Introduced role separation: planners generate tasks; workers implement; a judge loops iterations
Ran long experiments: a browser from scratch, a Solid→React migration, and product performance work
Key numbers
Over 1 million lines of code
1,000 files
Close to a week to build a web browser from scratch
Over three weeks for the Solid→React migration
+266K/-193K edits in that migration
Video rendering 25x faster via a Rust implementation
Why this was hard
Locks became a bottleneck; many agents spent time waiting instead of coding
The coordination layer was fragile when agents crashed or mishandled lock/state updates
Without hierarchy, agents avoided risky end-to-end tasks and churned on “safe” changes
Long-running autonomy introduced drift and tunnel vision, requiring periodic resets
How they solved it
Replaced flat coordination with a pipeline: planners define work; workers execute it
Made planning parallel and recursive by allowing planners to spawn sub-planners
Removed worker-to-worker coordination; workers focus only on their assigned task
Added a judge step each cycle to decide whether to continue before starting a fresh iteration
Dropped an “integrator” role after it created bottlenecks; workers handled conflicts themselves
Chose models per role; used GPT-5.2 for extended autonomy and planning over alternatives
Iterated heavily on prompting to prevent pathological coordination and focus failures
What changed
Hundreds of workers could push to the same branch with minimal conflicts
Large codebases remained understandable enough for new agents to contribute
A long-running agent delivered a merged Rust rewrite that made rendering 25x faster
The Solid→React migration reached CI/early-check passing status but still needed careful review
Why this matters beyond this company
Parallelism alone doesn’t scale; coordination mechanisms can dominate throughput
Purely “flat” agent swarms can become conservative and fail to own hard end-to-end work
Simpler orchestration (role separation, fewer chokepoints) can outperform heavier process layers
Prompt design can drive system behavior as much as the harness or model choice
Stealable ideas
Split agents into planners (task discovery) and workers (task execution)
Avoid lock-heavy shared-state coordination; it can become the main bottleneck
Prefer per-role model selection instead of a single universal model
Use periodic fresh-start cycles to limit drift in long-running autonomous work