[01]Article

Every AI Model Just Failed This New Coding Test

Meta and Stanford researchers released ProgramBench, where frontier models scored 0% at rebuilding programs from binaries alone.

James Roycroft-Davis··3 min read·For operators

Meta's Superintelligence Labs just dropped a reality check on the AI industry. Their new benchmark, ProgramBench, asks language models to do something deceptively simple: look at a compiled program and its documentation, then rebuild it from scratch. Every frontier model tested scored exactly zero.

The test strips away everything modern AI coding assistants rely on. No source code. No internet access. No decompilation tools. Just a binary file with run-only permissions and whatever documentation ships with the program. It's the coding equivalent of asking someone to rebuild a car engine while blindfolded, using only the owner's manual.

The Setup That Broke Every Model

ProgramBench comes from the same team that created SWE-bench, the previous gold standard for testing AI coding abilities. But where SWE-bench lets models fix bugs in existing codebases, ProgramBench demands something harder: reverse-engineering entire programs without seeing the original code.

The benchmark includes 161 programs across multiple languages. Some are simple command-line utilities. Others are full applications with graphical interfaces. The agent gets the compiled binary, any included documentation, and nothing else. Success means producing functionally identical source code that passes the same test suite as the original.

Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and every other tested model failed to correctly rebuild even one program. Not a single success across thousands of attempts.

Why This Matters for AI Hiring

For the past year, vendors have pitched "autonomous software engineers" to enterprises. The sales decks promise AI agents that can build entire applications from natural language descriptions. ProgramBench suggests these claims need serious recalibration.

"The results show current models lack fundamental program understanding," the researchers write. Models that ace coding interviews and fix bugs still can't grasp what a program actually does without seeing its source code.

This has immediate implications for how companies evaluate AI engineering talent. If you're hiring someone to lead AI-assisted development, they need to understand these limitations. The person who promises their team can ship faster by letting AI handle the architecture might not grasp what current models can and can't do.

The Cleanroom Problem

ProgramBench tests what researchers call "cleanroom implementation": building software that matches existing functionality without access to the original code. It's a common scenario in enterprise software development, especially when replacing legacy systems or ensuring compatibility with proprietary formats.

Human programmers do this regularly. They study how a program behaves, read its documentation, test edge cases, and gradually piece together how it must work internally. Current AI models can't even start this process.

The failure mode is telling. Models either produce syntactically invalid code or create programs that compile but do nothing resembling the original functionality. They can't infer program structure from behavior, can't deduce algorithms from outputs, and can't reconstruct logic from documentation.

What Zero Percent Actually Means

These aren't near misses. According to BenchLM's analysis, the models aren't getting 90% of the way there and failing on edge cases. They're producing completely unrelated programs or crashing entirely.

The gap between human and AI performance has rarely been this stark. Where a competent human programmer might successfully reverse-engineer 30-40% of these programs given enough time, every AI model sits at zero.

For AI operators, this benchmark reframes the conversation about autonomous coding. Instead of asking when AI will replace programmers, the question becomes: what fundamental capabilities do these models still lack? ProgramBench suggests the list is longer than many vendors want to admit.

The next time someone pitches you an "AI software engineer," ask them about ProgramBench. Their response will tell you whether they understand the current state of AI coding capabilities or whether they're selling science fiction.

[02]Sources

  1. ProgramBench Benchmark Review: Why Top AI Models Score 0%
  2. ProgramBench: Can Language Models Rebuild Programs From Scratch?
  3. ProgramBench Benchmark Explained: Can LLMs Rebuild Programs From Binaries? | BenchLM.ai
  4. ProgramBench: Every Frontier LLM Scores 0% | aiHola
  5. ProgramBench: Can Language Models Rebuild Programs From Scratch? | alphaXiv

Ready to put this into practice?

Get a Human in Residence
Build your team →