[01]Article

The Judgment Labs Model: Building a 'Continuous Improvement Layer' Team

Judgment Labs' $32M raise introduces a new team archetype focused on turning production failures into agent performance gains without retraining models.

Nick Lebesis··3 min read·For builders

The Production Data Problem

Judgment Labs just raised $32M to solve what every AI builder discovers six months after launch: your agents get dumber in production, not smarter. The company, which closed seed and Series A rounds led by Lightspeed Venture Partners, is betting that the answer isn't better models — it's better teams.

The pitch is deceptively simple. Instead of waiting for GPT-5 or Claude 4, build a "continuous improvement layer" that turns your production data into performance gains. Judgment Labs reports their platform is already running at "agent-native companies," though they're keeping client names close.

This isn't about evaluation. It's about what happens after evaluation.

The Harness Loop Pattern

The best AI teams already do this. The AI Runtime documented how Harvey, Hippocratic, Anterior, and Azure SRE turn production failures into skills without touching their base models. They call it the "harness loop" — a continuous cycle where agents learn from their mistakes in real time.

Here's how it works at Harvey (the legal AI startup): When their agent drafts a contract clause that a lawyer rejects, that rejection becomes training data. Not for the model — for the routing logic, the prompt templates, the retrieval system. The agent gets better at that specific type of clause without a single GPU hour of retraining.

The architecture has evolved. Knowlee's analysis shows that serious agent platforms now have seven distinct layers, not the single "LLM + tools" box from 2023. The continuous improvement layer sits between the execution layer and the evaluation layer, constantly adjusting based on production outcomes.

Building the Team

So what does a continuous improvement team actually look like? Based on patterns from teams already doing this:

The core trio: A production ML engineer who understands distributed systems, a data engineer who can build real-time pipelines, and what Harvey calls a "domain specialist" — someone who deeply understands the work the agent is supposed to do.

At Hippocratic (healthcare AI), this specialist is an MD. At Anterior (insurance), it's someone with a decade of claims processing experience. They're not annotating data — they're designing the feedback loops that matter.

The evaluation architect: Medium's Micheal Lanham argues that "LLM-as-judge is no longer enough on its own." Teams need someone who can build multi-layer evaluation systems: outcome evaluation (did the task succeed?), step evaluation (was each action correct?), and meta-evaluation (is our evaluation itself accurate?).

The feedback engineer: This is the newest role. Composo describes building "an agent evaluation engine that gets better the more you use it." Someone needs to own this learning system — to ensure that every production interaction makes the next interaction slightly better.

The magic is in the mundane: logging every decision, tracking every correction, building pipelines that turn failures into improvements within hours, not months.

Judgment Labs is betting $32M that most teams can't build this infrastructure alone. They're probably right. But understanding the pattern — production data to performance gains without retraining — that's the real insight.

The best AI agents don't ship once and hope. They ship, fail, learn, and improve — all without their creators touching the base model.

[02]Sources

  1. Judgment Labs Closes $32M in Seed and Series A Funding to Build the Continuous Improvement Layer for AI Agents
  2. How Vertical Agents Self-Improve in Production
  3. AI Agent Platform Architecture 2026: Reference Patterns + Layer Decomposition | Knowlee Blog
  4. Why Your Agent Evaluation Stack is About to Get Weirder (and Better) | by Micheal Lanham | Apr, 2026 | Medium
  5. From One Judge to a Learning System | Composo Blog

Ready to put this into practice?

Apply to be a Human in Residence
Build your team →