Cursor built Bugbot to catch logic, performance, and security issues in PRs, then optimized it using an AI-scored “resolution rate.” Over 40 experiments, it increased resolution rate from 52% to 70%+ while flagging more bugs per run.
What actually happened
Cursor’s engineers spent more time reviewing as coding agents improved, so they built a PR review agent.
Early work relied on internal qualitative polling to reduce false positives.
They introduced an AI metric (“resolution rate”) to measure whether authors fixed reported bugs by merge time.
They used the metric to run online/offline experiments and iterate through 11 shipped versions.
A shift to a fully agentic architecture delivered the largest gains.
Key numbers
40 major experiments since launch
Resolution rate: 52% to over 70%
Average bugs flagged per run: 0.4 to 0.7
Resolved bugs per PR: ~0.2 to ~0.5
Eight parallel passes in an early production flow
Version 1: July 2025; Version 11: January 2026
More than two million PRs reviewed per month
Why this was hard
Model improvements made reviews possible, but quality hinged on controlling false positives.
GitHub constraints required rate-limit monitoring and proxy-based infrastructure.
Codebases needed custom invariants without hardcoding checks into the agent.
Many proposed improvements regressed metrics, contradicting intuition from qualitative review.
How they solved it
Ran eight parallel bug-finding passes with randomized diff order.
Bucketed similar findings, then used majority voting to filter weak signals.
Added a validator model step to catch false positives.
Filtered unwanted categories and deduped against prior runs.
Rebuilt Git integration in Rust; minimized fetched data; added batching and rate-limit monitoring.
Added “Bugbot rules” so teams can encode codebase-specific invariants.
Defined “resolution rate” using AI classification at merge time; spot-checked with PR authors.
Evaluated changes online via real resolution rates and offline via BugBench.
Switched to an agentic loop that reasons, calls tools, and chooses where to dig deeper.
Moved from static to dynamic context discovery; tuned tool interfaces to shape behavior.
What changed
Resolution rate increased from 52% to over 70%.
Bugs flagged per run rose from 0.4 to 0.7 without comparable false-positive increases.
Resolved bugs per PR more than doubled, from ~0.2 to ~0.5.
Bugbot now reviews more than two million PRs per month and runs on Cursor’s internal code.
Why this matters beyond this company
If you can’t measure “was this actually fixed,” qualitative review loops hit a ceiling.
Majority voting across diverse passes can raise confidence in LLM-found issues.
Agentic designs shift the bottleneck to prompting, tool design, and context retrieval.
Stealable ideas
Track a “resolution rate” at merge time, not just comment reactions.
Use parallel passes with randomized diff ordering plus majority voting.
Add a validator model step specifically for false-positive suppression.
Provide repo-specific rules to encode invariants without hardcoding logic.