Cursorresearch

Building a better Bugbot

January 15, 2026 at 12:00 PM

Summary

TL;DR

Cursor built Bugbot to catch logic, performance, and security issues in PRs, then optimized it using an AI-scored “resolution rate.” Over 40 experiments, it increased resolution rate from 52% to 70%+ while flagging more bugs per run.

What actually happened

Cursor’s engineers spent more time reviewing as coding agents improved, so they built a PR review agent.
Early work relied on internal qualitative polling to reduce false positives.
They introduced an AI metric (“resolution rate”) to measure whether authors fixed reported bugs by merge time.
They used the metric to run online/offline experiments and iterate through 11 shipped versions.
A shift to a fully agentic architecture delivered the largest gains.

Key numbers

40 major experiments since launch

Resolution rate: 52% to over 70%

Average bugs flagged per run: 0.4 to 0.7

Resolved bugs per PR: ~0.2 to ~0.5

Eight parallel passes in an early production flow

Version 1: July 2025; Version 11: January 2026

More than two million PRs reviewed per month

Why this was hard

Model improvements made reviews possible, but quality hinged on controlling false positives.
GitHub constraints required rate-limit monitoring and proxy-based infrastructure.
Codebases needed custom invariants without hardcoding checks into the agent.
Many proposed improvements regressed metrics, contradicting intuition from qualitative review.

How they solved it

Ran eight parallel bug-finding passes with randomized diff order.
Bucketed similar findings, then used majority voting to filter weak signals.
Added a validator model step to catch false positives.
Filtered unwanted categories and deduped against prior runs.
Rebuilt Git integration in Rust; minimized fetched data; added batching and rate-limit monitoring.
Added “Bugbot rules” so teams can encode codebase-specific invariants.
Defined “resolution rate” using AI classification at merge time; spot-checked with PR authors.
Evaluated changes online via real resolution rates and offline via BugBench.
Switched to an agentic loop that reasons, calls tools, and chooses where to dig deeper.
Moved from static to dynamic context discovery; tuned tool interfaces to shape behavior.

What changed

Resolution rate increased from 52% to over 70%.
Bugs flagged per run rose from 0.4 to 0.7 without comparable false-positive increases.
Resolved bugs per PR more than doubled, from ~0.2 to ~0.5.
Bugbot now reviews more than two million PRs per month and runs on Cursor’s internal code.

Why this matters beyond this company

If you can’t measure “was this actually fixed,” qualitative review loops hit a ceiling.
Majority voting across diverse passes can raise confidence in LLM-found issues.
Agentic designs shift the bottleneck to prompting, tool design, and context retrieval.

Stealable ideas

Track a “resolution rate” at merge time, not just comment reactions.
Use parallel passes with randomized diff ordering plus majority voting.
Add a validator model step specifically for false-positive suppression.
Provide repo-specific rules to encode invariants without hardcoding logic.

Read original article

Building a better Bugbot

Summary

What actually happened

Key numbers

Why this was hard

How they solved it

What changed

Why this matters beyond this company

Stealable ideas

More from Cursor

Building a better Bugbot

Summary

What actually happened

Key numbers

Why this was hard

How they solved it

What changed

Why this matters beyond this company

Stealable ideas

More from Cursor