[01]Article

Scale AI Tests Real Coding Agents: Cursor Tops at 61%

The first benchmark to rank agent-plus-harness combos reveals surprising winners beyond raw LLM performance.

James Roycroft-Davis··3 min read·For operators

Scale AI just shipped the final piece of SWE Atlas, completing the first benchmark that measures actual coding agents, not just bare language models. The Refactoring Leaderboard joins Test Writing and Codebase Q&A to form a suite that tests agents across real engineering workflows.

Here's what matters: Cursor CLI plus Claude 3.5 Sonnet scored 61% on the combined benchmark. That's the top spot. But the second-place finisher tells the more interesting story.

The Harness Makes the Agent

The rankings show something counterintuitive. Raw model capability doesn't predict agent performance. A weaker model in a better harness beats a stronger model flying solo.

SWE-bench Pro already hinted at this pattern. Teams obsessed over GPT-4 versus Claude while ignoring the scaffolding around them. Scale's new data confirms it: the agent framework matters more than the underlying LLM for most engineering tasks.

The benchmark spans 284 tasks total. Codebase Q&A gets 124 tasks, Test Writing gets 90, and Refactoring gets 70. Each workflow tests different agent capabilities. Q&A measures code comprehension. Test Writing checks whether agents understand edge cases. Refactoring demands structural changes while maintaining functionality.

Cursor's Edge

Cursor CLI combines Claude 3.5 Sonnet with custom tooling for code navigation and editing. The 61% score reflects consistent performance across all three task types. Most competitors excel at one workflow and stumble on others.

The second-place finisher (Scale hasn't disclosed which yet) scored 58%. Close enough to suggest the race remains open. Third place drops to 52%, indicating a clear top tier emerging.

What separates the winners? Tool use. The top agents don't just generate code. They navigate codebases, run tests, and iterate on failures. Cursor's harness includes file search, syntax-aware editing, and test execution. These tools turn a language model into something approaching a junior engineer.

Beyond Issue Resolution

Previous benchmarks like SWE-bench focused on fixing GitHub issues. Real engineering involves more than bug fixes. Scale designed SWE Atlas to capture the full development cycle.

Codebase Q&A tasks ask agents to explain complex systems. "What does this authentication flow do?" or "How does the caching layer work?" Test Writing requires understanding both happy paths and edge cases. Refactoring challenges agents to improve code structure without breaking functionality.

The results suggest current agents handle isolated tasks better than integrated workflows. Even Cursor at 61% leaves substantial room for improvement. No agent yet matches a competent human engineer across all three domains.

What This Means for Operators

If you're evaluating coding agents, stop comparing raw LLMs. The harness determines real-world performance more than the model. Cursor's win demonstrates this clearly.

Look for agents that include robust tooling. File navigation, test execution, and iterative debugging separate toys from tools. The 61% ceiling also sets expectations. These agents augment engineers but don't replace them.

Scale plans quarterly updates to SWE Atlas. As agents improve, the benchmark will add harder tasks. The current leaders might not stay on top. But one pattern seems clear: the age of bare LLM comparison is over. The harness is the product.

[02]Sources

  1. SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution | Scale Labs
  2. SWE Atlas is Complete: Measuring Coding Agents Across the Engineering Loop | Scale AI
  3. SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution | Cool Papers - Immersive Paper Discovery
  4. Scale Labs debuts new Refactoring Leaderboard for AI
  5. What Does SWE-bench Pro Reveal About Agent Scaffold Performance? | BSWEN

Ready to put this into practice?

Get a Human in Residence
Build your team →