Agent Harnesses: The AI Control Plane
The harness - not the model - determines production reliability. A deep look at what agent harnesses are, what they provide, and the top frameworks for general use and for coding.

The 98.4% Nobody Talks About
When a developer uses Claude Code to autonomously fix a bug, most people credit "the AI" - meaning the underlying language model. But a 2026 VILA-Lab research analysis of Claude Code's source code found something striking: only 1.6% of the codebase constitutes actual AI decision logic. The remaining 98.4% is infrastructure.
That infrastructure has a name: the agent harness.
A harness is the runtime control plane that wraps raw LLM inference and turns a stateless token predictor into a system that can autonomously act over time. The same underlying model - Claude, GPT-5, Gemini - performs radically differently depending on which harness surrounds it. Understanding harnesses is, therefore, understanding where the real engineering leverage in AI systems lives.
What a Harness Actually Provides
Raw inference gives you: send text in, get text out.
A harness adds the following on top of that:
| Layer | What It Does |
|---|---|
| Context Management | Assembles the context window each turn - what history to include, which tool results to feed back, which files are relevant, and when to compress or summarize when the token limit approaches. |
| Tool Dispatch | Catches tool calls emitted by the model, routes them to the correct executor (bash, browser, database, API), runs them in a sandboxed environment, and injects the result back into the next model call. This is the Reason → Act → Observe loop. |
| Memory Systems | Manages working memory (current task state), short-term episodic memory (conversation buffer), and long-term storage (vector stores for semantic recall, key-value stores for fast fact lookup). |
| Skill / Prompt Loading | Loads system prompts, persona definitions, few-shot examples, and capability files at session start. Claude Code reads CLAUDE.md; Cursor reads .cursorrules; Cline reads .clinerules. |
| Session & State Control | Checkpointing, resume after failure, conversation threading, and durable execution across long-running tasks. |
| Permission & Safety Gates | Evaluates whether an action (file write, shell command, API call) is permitted before the model executes it. |
| Multi-Agent Coordination | Spawns sub-agents, manages handoffs, aggregates results, and prevents context pollution between specialist agents. |
The architectural consequence: four independent teams studied in the VILA-Lab paper independently converged on nearly identical harness structures when building production coding agents. The patterns are converging because the problems they solve are universal.
Part I - Top 10 General-Purpose Harnesses
These harnesses are model-agnostic and designed to orchestrate agents for any domain - research, automation, data pipelines, customer service, and beyond.
Overview
| Framework | Builder | Stars | License | Core Idea |
|---|---|---|---|---|
| LangGraph | LangChain AI | ~55K | MIT / SaaS | Explicit directed graphs with durable state |
| AutoGen / AG2 / MAF | Microsoft / Community | 50K+ | MIT / Apache 2.0 | GroupChat patterns → typed graph workflows |
| CrewAI | CrewAI Inc. | ~48K | MIT | Role-based Crews + deterministic Flows |
| Letta | Letta Inc. | - | Apache 2.0 | OS-style virtual memory paging for context |
| Semantic Kernel / MAF | Microsoft | ~28K | MIT | Enterprise plugin model, .NET / Python / Java |
| LlamaIndex | LlamaIndex Inc. | ~50K | MIT | Data-native agents with best-in-class retrieval |
| Haystack | deepset | ~25K | Apache 2.0 | Type-safe, auditable pipeline components |
| PydanticAI | Pydantic team | ~17K | MIT | Typed agents with dependency injection |
| DSPy | Stanford NLP | ~35K | MIT | Compiled prompts - optimization over hand-writing |
| Instructor | Jason Liu | ~13K | MIT | Structured output layer with auto-retry |
LangGraph
LangGraph's core bet - agents should be explicit directed graphs rather than implicit while-loops - proved highly influential. Nodes are LLM calls or Python functions; edges are conditional control flow. The framework persists agent state through failures and resumes from the exact node where execution stopped, making long-horizon autonomous runs practical without babysitting. Human-in-the-loop is a first-class primitive: you can pause, inspect, and modify agent state at any graph node. A March 2026 release added distributed runtime and deep agent templates. The commercial layer (LangGraph Platform) deploys these graphs to production with monitoring built in.
Best for: Enterprise multi-agent workflows where production reliability and explicit control flow are non-negotiable.
AutoGen / AG2 / Microsoft Agent Framework (MAF)
AutoGen introduced GroupChat - a multi-agent conversation pattern where a selector determines which agent speaks next, enabling emergent collaborative reasoning. In 2026, Microsoft deprecated AutoGen in favour of the Microsoft Agent Framework (MAF), a production successor combining AutoGen's agent abstractions with Semantic Kernel's enterprise features: full async support, typed graph workflows, Azure AI Foundry integration, and a stable versioned API. The community preserved backward-compatible GroupChat semantics in the AG2 fork (ag2ai/ag2). AG2 is the research and prototyping choice; MAF is the enterprise production path.
Best for: Microsoft-centric engineering teams and conversational multi-agent scenarios at scale.
CrewAI
CrewAI is a lean, standalone Python framework - explicitly built without LangChain as a dependency - where agents receive a role, goal, and backstory and tasks are assigned within a Crew. Its dual architecture is the key differentiator: Crews optimize for collaborative autonomous intelligence (sequential, parallel, or hierarchical topologies), while Flows provide deterministic, event-driven workflow control suited to production reliability requirements. As of mid-2026: 47.8K stars, 27M downloads, 2+ billion agent executions, and 150+ enterprise customers. Native MCP and A2A (Agent-to-Agent protocol) support make it a strong choice for interoperable multi-agent ecosystems.
Best for: Production multi-agent systems where structured role assignments and predictable task delegation matter.
Letta (formerly MemGPT)
Letta emerged from the MemGPT research paper (UC Berkeley, 2023) with a fundamental architectural insight: LLM context should work like operating system virtual memory, actively paging information in and out based on what the agent needs right now. The implementation uses a two-tier memory model - main context (in-window memories accessed via function calling) plus external stores for recall (recent conversation with vector search) and archival (long-term facts with semantic retrieval). This enables agents that genuinely remember users across sessions. The Conversations API (January 2026) extended memory sharing across parallel agent interactions. Letta Code reached #1 on Terminal-Bench among model-agnostic open-source agents in December 2025.
Best for: Long-running stateful assistants where cross-session continuity and persistent memory are mission-critical.
Semantic Kernel / Microsoft Agent Framework
Semantic Kernel introduced the plugin as a first-class primitive in AI orchestration: any decorated Python or C# function becomes automatically callable by the LLM. The planning subsystem decomposes tasks and selects appropriate plugin combinations. In late 2025, Microsoft announced SK and AutoGen would converge into MAF - effectively making MAF "Semantic Kernel v2.0." SK v1.x continues with security patches for enterprise teams who cannot migrate immediately. Its distinctive feature remains its enterprise-first design: stable versioned APIs, strong .NET support, built-in telemetry and middleware, and tight Azure integration.
Best for: Enterprise .NET and Python teams building Azure-native workflows where compliance, observability, and long-term API stability are required.
LlamaIndex
LlamaIndex began as "GPT Index" - a framework for connecting LLMs to external data - and evolved into both a leading document intelligence platform and a full agent orchestration framework. The Workflows 1.0 release (2025) marked its agent transition: async-first, event-driven, step-based execution that can route between capabilities, parallelise work, and maintain state across steps. Its strongest differentiator remains data lineage: unmatched connectors for documents, databases, and APIs, best-in-class retrieval and RAG pipelines, and a data-aware architecture where retrieval is a first-class operation. ACP integration enables interoperability with other frameworks.
Best for: Document-heavy applications - legal tech, knowledge management, enterprise search - where retrieval quality is the critical path.
Haystack
Haystack, built by German AI company deepset, prioritises production correctness above all: its modular pipeline architecture gives engineers explicit, auditable control over every step, and typed component schemas catch invalid configurations at build time rather than at runtime. The built-in Agent component manages the full tool-calling loop with streaming, human-in-the-loop tool-call interception, and state management. A 2025 update lets Haystack agents be exposed as MCP servers via Hayhooks, enabling other agents to call Haystack pipelines as tools. Haystack's heritage in NLP and semantic search informs its production discipline.
Best for: Enterprise engineering teams prioritising observability, type safety, and pipeline auditability over developer-experience shortcuts.
PydanticAI
PydanticAI is a ground-up agent framework from the Pydantic team (Samuel Colvin et al.) - the creators of the de facto Python data validation standard. Released stable in September 2025, it applies the "explicit types everywhere" philosophy to agents: (1) agents are typed Python objects; (2) tools use formal dependency injection via RunContext - typed, testable in isolation, swappable between production and test; (3) all outputs run through Pydantic validation with automatic retry on failure. The result is arguably the most testable agent framework available - individual tools can be unit-tested by injecting mock dependencies without spinning up an LLM. Provider-neutral: OpenAI, Anthropic, Google, Mistral, Ollama, and more.
Best for: Python engineering teams who already depend on Pydantic for data validation and want the same rigor in their agent layer.
DSPy
DSPy (Stanford NLP) takes the most radical departure from prompt-centric design: instead of writing prompts, you write typed signatures - input and output field names with descriptions - and DSPy's compiler finds the optimal prompts, few-shot examples, and fine-tuning targets automatically against a provided metric. The core insight is that prompts are hyperparameters, not source code, and should be found by optimization rather than hand-written. The optimizer suite (MIPROv2, GEPA - ICLR 2026 Oral) treats LLM pipelines as compilable programs. Built-in modules include Chain-of-Thought, ReAct, Retrieve, and multi-hop reasoning patterns. v2.5 added multi-module optimization and 3–5× faster compilation.
Best for: NLP research teams and ML engineers who want rigorous, measurement-driven optimization of complex multi-step pipelines.
Instructor
Instructor is deliberately not a full agent framework - it is the structured output layer that sits under your agents and ensures that when the LLM is asked to return JSON, it returns valid, schema-conforming JSON. It wraps any LLM provider, adds Pydantic model validation on top, and automatically re-asks the model on validation failure. Available in Python, TypeScript, PHP, Ruby, Elixir, and Go with a consistent API. Streaming partial results and iterables are supported. OpenAI cited Instructor as inspiration for their native structured output feature. At 6M+ monthly PyPI downloads, it is the most widely used structured output library in the ecosystem.
Best for: Any production LLM application that needs reliable structured data extraction - typically used as a component within larger harnesses.
Part II - Top Coding-Specific Harnesses
Coding harnesses add deeper IDE integration, code-aware context indexing, sandboxed execution, and git-centric workflows on top of the general patterns above.
Overview
| Harness | Builder | Stars | License | Core Idea |
|---|---|---|---|---|
| Claude Code | Anthropic | - | Proprietary | CLAUDE.md skills, 3-tier permissions, MCP-first |
| Aider | Paul Gauthier | 44K+ | Apache 2.0 | Git-first, repo map via Tree-sitter, Architect Mode |
| Devin | Cognition | - | Commercial | Full Linux sandbox + Windsurf IDE |
| SWE-Agent | Princeton NLP | ~19K | MIT | Agent-Computer Interface, SWE-bench |
| OpenHands | All Hands AI | 65K+ | MIT | Event stream architecture, Docker sandbox |
| GitHub Copilot Agent | GitHub/Microsoft | - | Commercial | Native GitHub integration, Actions-powered VM |
| Windsurf | Codeium / Cognition | - | Commercial | Codemaps, SWE-1.5 at 950 tok/s |
| Cursor | Anysphere | - | Commercial | Full IDE fork, 8 parallel cloud agents |
| Goose | Block / AAIF | 44.7K+ | Apache 2.0 | Rust, MCP-first, 70+ extensions, vendor-neutral |
| Cline | Open source | 61.2K | Apache 2.0 | VS Code native, 30+ providers, cost transparency |
| OpenCode | opencode-ai | 160K+ | Apache 2.0 | Go TUI, 75+ providers, LSP-native, AGENTS.md |
Claude Code
Anthropic's CLI-based coding harness wraps Claude models (Sonnet 4.6 and Opus 4.5) with the infrastructure that turns a language model into an autonomous software engineer. Its permission model has three tiers: auto-approved read-only operations, state-modifying actions requiring confirmation, and blocked dangerous operations. "Auto mode" uses a background Sonnet 4.6 classifier to evaluate whether state-modifying actions can proceed without prompting. The skill system is file-based: CLAUDE.md files at the project root or ~/.claude/ directory are loaded at session start and define persona, rules, tool permissions, and custom skill sets. A settings.json hooks system enables automated behaviors (pre/post-commit, tool interception). Sub-agent delegation spawns specialist Claude instances for parallel tasks. Dynamic workflows entered research preview in May 2026.
Best for: Terminal-native developers working with Claude who want the deepest Anthropic model integration and a configurable file-based skill system.
Aider
The most downloaded open-source coding agent - 44K+ GitHub stars and 6.6M+ PyPI installs - Aider's architectural signature is its repo map: a compressed structural index of the entire codebase built using Tree-sitter (class names, function signatures, imports, file relationships) that gives the LLM codebase-wide awareness without overwhelming the context window. The modular Coder system offers surgical line-level edits (EditBlockCoder), full file rewrites (WholeFileCoder), and multi-step refactors (ArchitectCoder). Architect Mode splits work between a planning model (typically a frontier model) and a faster execution model - improving quality on complex tasks while managing cost. Every change is automatically committed with a descriptive git message, creating a recoverable history. Genuinely model-agnostic: 100+ programming languages, 40+ LLM providers including local models via Ollama.
Best for: Terminal-native developers and teams that want maximum model flexibility, git-centric safety, and no vendor dependency.
Devin (Cognition)
The AI coding agent that defined the "autonomous software engineer" category in 2024. Devin accepts a task - a Jira ticket, a Slack message, a natural-language description - boots a sandboxed Linux VM with its own browser, terminal, and code editor, executes the full engineering workflow autonomously, and returns a pull request for human review. In late 2025, Cognition acquired Windsurf (formerly Codeium) for approximately $250M, integrating Devin's proprietary SWE-1.5 and SWE-grep models into the Windsurf IDE environment. Cognition's May 2026 funding round valued the company at $26B. Enterprise customers include Goldman Sachs, Citi, Ramp, and Palantir.
Best for: Enterprise engineering teams who want to delegate complete software tasks asynchronously and receive pull requests rather than inline suggestions.
SWE-Agent
Developed at Princeton NLP and Stanford, SWE-Agent introduced the Agent-Computer Interface (ACI) - a purpose-built OS-like layer designed for language models to interact with code systematically (browsing structure, reading documentation, examining tests) rather than through ad-hoc bash commands. Published at NeurIPS 2024. The most significant 2025 development is mini-SWE-agent: a 100-line implementation achieving 74%+ on SWE-bench Verified, demonstrating that the core agent loop can be radically simplified without sacrificing competitive performance. Citations from IBM, NVIDIA, Meta, and Anyscale reflect its research influence. Model-agnostic and MIT-licensed.
Best for: AI researchers, automated QA teams, and developers exploring self-healing CI pipelines.
OpenHands (formerly OpenDevin)
The most-starred open-source coding agent platform - 65K+ GitHub stars - OpenHands operates on an event stream architecture: all agent actions and observations are logged chronologically, giving the agent a complete, auditable history of everything it did. The agent can use a bash terminal, Jupyter notebook, and browser, executing everything inside a securely isolated Docker container spawned per session. The CodeAct pattern - having the agent generate and execute Python code as its primary action space - enables richer actions than fixed tool sets. Published as a conference paper at ICLR 2025. Claimed resolution rate: 87% of bug tickets on the same day. The MIT-licensed core can be self-hosted; a commercial cloud platform is also available.
Best for: Engineering teams wanting open-source autonomous coding infrastructure they can run, audit, and extend - and researchers studying agent behaviour at scale.
GitHub Copilot Coding Agent
The only major coding harness where the agent lives natively inside the software collaboration platform itself. When assigned a task (via GitHub issue or Copilot Chat), the agent boots a VM, clones the repository, analyzes the codebase with RAG-powered GitHub code search, and executes in a GitHub Actions-powered environment - returning commits to a dedicated branch for human review. MCP integration is first-class: GitHub MCP and Playwright MCP are enabled by default without additional configuration. Copilot Spaces provide persistent context containers combining repositories, issues, documentation, and custom instructions. The model picker supports GPT-5.1-Codex-Max, Claude Opus 4.5, Gemini 2.0 Flash, and others. GA'd September 2025.
Best for: GitHub-native engineering teams who want zero-configuration agentic automation woven into their existing issue and PR workflow.
Windsurf (Codeium / Cognition)
An AI-native IDE - a VS Code fork - with Cascade as its core agentic engine. After Cognition's acquisition, Windsurf 2.0 added several differentiating features: Codemaps (live, semantically-aware diagrams of the codebase showing modules, relationships, and data flows with AI annotations); SWE-1.5 running at 950 tokens/second (13× faster than Claude Sonnet 4.5); an Agent Command Center (Kanban for tracking local and cloud agents); and Devin integration for fully autonomous cloud execution. The Memories feature learns individual coding patterns over ~48 hours of use. $82M ARR and 350+ enterprise customers at the time of acquisition.
Best for: Professional developers wanting deep codebase understanding combined with autonomous execution, unified in one IDE.
Cursor
A full VS Code fork by Anysphere Inc. (four MIT co-founders), Cursor's advantage over plugin-based tools is being the IDE itself: file access, terminal, multi-file diffs, and browser interactions operate with lower latency than any extension layer. The indexing system breaks files into semantically meaningful chunks, converts them to vector embeddings, and tracks changes via hashed file structures for efficient incremental updates. Cursor 3 (April 2026) introduced an agent management console supporting up to 8 parallel agents on isolated Git branches. Cursor 3.5 (May 2026) added Cloud Agents - isolated VMs that accept a task, run autonomously, and report progress back to the IDE. Anysphere raised a $2.3B Series D at a $29.3B valuation in November 2025 - the fastest SaaS company to reach $100M ARR.
Best for: Developers who want the most tightly integrated IDE experience with multi-agent parallel execution.
Goose (Block / AAIF)
Built originally by Block Inc. (Square, Cash App) and donated to the Agentic AI Foundation (AAIF) under the Linux Foundation umbrella, Goose is built in Rust for performance and ships as a native desktop app (macOS, Linux, Windows), a CLI, and an embeddable API. The architecture is MCP-first: a core agent loop, a provider abstraction layer, and an extension system built entirely on the Model Context Protocol, supporting 70+ extensions and 15+ LLM providers with no vendor lock-in. Recipes are reusable workflow templates. 44.7K+ stars with 350+ contributors and 100+ releases in its first year. Apache 2.0 licensed - fully free for commercial use, no usage telemetry by default.
Best for: Developers and organizations who need a genuinely open, vendor-neutral, locally-running coding agent with no dependency on any single LLM provider or cloud platform.
Cline
The most-starred open-source VS Code coding agent extension - 61.2K stars, 8M+ developers, 5M+ Marketplace installs - Cline reads project structure, makes coordinated changes across the codebase, and monitors linter and compiler output in real time, fixing import errors and type mismatches before the developer sees them. Terminal commands run directly in VS Code's integrated terminal; for long-running processes, Cline monitors output continuously and reacts as new output arrives. 30+ LLM providers (Anthropic, OpenAI, Google, AWS Bedrock, Azure, Ollama, Cerebras, DeepSeek, and more). Cline displays the exact token count and estimated dollar cost for every interaction - a transparency-first design that lets developers make informed cost decisions per session (a typical feature implementation: $0.50–$2.00). The Cline SDK enables custom multi-agent systems with Kanban boards and scheduled automations.
Best for: VS Code developers who want open-source autonomous coding with full model choice, cost visibility, and no vendor dependency.
OpenCode
The most-starred open-source coding agent in existence - 160K+ GitHub stars, 900+ contributors - OpenCode is a terminal-native agent built in Go with a polished TUI powered by Bubble Tea. Its architectural signature is a client-server daemon: opencode serve runs a headless OpenAPI server, decoupling the agent loop from the interface and enabling async, remote, and multi-session use across the same project simultaneously. AGENTS.md files (OpenCode's equivalent of CLAUDE.md) assemble context through a layered system - project-level, directory-level, and user-level files - giving the agent persistent codebase knowledge without re-explanation each session. LSP integration is a first-class runtime feature, not a bolted-on extra: OpenCode connects to your project's language server (gopls, typescript-language-server, etc.) and feeds type information, import paths, and live diagnostics directly to the model after every edit. Git-based snapshots with /undo and /redo make destructive edits recoverable. 75+ LLM providers supported including Anthropic, OpenAI, Google, AWS Bedrock, Azure, Groq, and local models via Ollama. Apache 2.0 licensed with no telemetry by default.
Best for: Terminal developers who want the fastest-growing open-source coding agent with native LSP diagnostics, provider freedom, and a daemon architecture suited to automation and remote use.
Key Trends (Mid-2026)
The harness is the product. Model capability is table stakes; harness reliability is the competitive moat. The VILA-Lab finding - 1.6% AI logic, 98.4% infrastructure - crystallizes what experienced practitioners already knew.
MCP is winning the tool integration war. LangGraph, CrewAI, Letta, Haystack, Claude Code, Cline, Goose, and GitHub Copilot all support the Model Context Protocol. The era of proprietary tool integrations is ending; MCP is becoming the standard plugin surface.
Memory remains the unsolved layer. Context management and cross-session memory is the most critical and least mature component. Letta's OS-paging model, Mem0's multi-hop retrieval, and LangMem's session management represent competing approaches with no clear winner yet.
Consolidation at the top of the coding stack. Cognition's acquisition of Windsurf and Microsoft's merger of AutoGen and Semantic Kernel into MAF signal that the agentic coding space is consolidating around vertically integrated stacks - autonomous cloud execution + IDE + proprietary models - rather than modular, interoperable components.
Open source holds its own. OpenCode (160K stars), Cline (61K stars), OpenHands (65K stars), Goose (44.7K stars), and Aider (44K stars) show that the open-source coding agent tier is healthy, growing, and increasingly MCP-native. The open-source moat is provider neutrality, cost transparency, and self-hosting - structural advantages proprietary tools cannot replicate.
Summary: Key Takeaways
- An agent harness is the runtime control plane that wraps LLM inference to provide context management, tool dispatch, memory, skill loading, session control, safety gates, and multi-agent coordination.
- A 2026 analysis of Claude Code found only 1.6% of the codebase is AI decision logic; the rest is harness infrastructure - the same ratio applies across production agent systems.
- General-purpose harnesses (LangGraph, CrewAI, Letta, PydanticAI, DSPy) differ primarily in their memory model, graph topology, and plugin system - not in the underlying reasoning loop.
- Coding harnesses (Claude Code, Aider, OpenHands, Cline, Goose, OpenCode) differ primarily in IDE integration depth, sandbox isolation, git-safety model, and provider neutrality.
- MCP is emerging as the universal plugin standard across both categories, with the majority of active harnesses now supporting it natively.
- The harness stack - not the model alone - is where AI product differentiation is being won and lost in 2026.
