CLI Agents 2026: Claude Code × Codex CLI and the Terminal-Bench Race

A June 2026 field guide to CLI coding agents, why Terminal-Bench 2.1 became the accepted benchmark, and how each major terminal-native agent fits into the stack.

Terminals

The CLI agent category is exploding because coding agents are moving from chat sidebars into the terminal itself.

AgentCompanyNotes
Claude CodeAnthropicSet the reference pattern: reads repo, edits files, runs commands, iterates on failures. Full plan→patch→test→explain loop.
CodexOpenAIFast code synthesis + command execution. Best when the repo has strong tests and clear package scripts.
GrokxAIDirect xAI model access from the shell. Practical value depends on how well the harness manages context and local files.
CursorCursorBest known as AI editor; CLI adds the same agentic workflow for scripts, automation, and remote environments.
GitHub CopilotGitHub / MicrosoftNatural Copilot extension into shell workflows. Especially useful for command explanation, Git help, and GitHub-native context.
OpenCodeOpenCodeOpen, terminal-native, model-agnostic. For developers who want inspectability and no editor lock-in.
MiMo CodeXiaomiXiaomi's entry into the terminal developer loop: understand repo, edit code, validate through commands.
AmpSourcegraphLightweight interface: ask for a change, let the agent inspect, review the diff, keep moving.
OpenClaudeCommunityAnother Claude-style terminal pathway. Part of the long tail proving the category is bigger than any single vendor.
AntigravityGoogleGoogle's agentic CLI direction. Judged by how reliably the harness manages files, commands, and verification.
PiPiDirect command-line interaction. Works in the same environment as local builds, package managers, and deployment scripts.
oh-my-pioh-my-piExtends Pi with shell-native ergonomics. Adoption depends on interface speed as much as benchmark scores.
Hermes AgentNous ResearchKey for teams exploring open and research-driven model stacks rather than closed commercial agents.
DevinCognitionPart of a broader autonomous engineering workflow. Terminal interface for repo, commands, failures, and patches.
GooseBlockBlock's open-source local agent. Strong example of the harness becoming as important as the underlying model.
AuggieAugment CodeCodebase-aware assistance in the terminal. Valuable for large repos where context retrieval is the hard part.
Autohand CodeAutohandAutonomous code changes from the CLI. Small, reviewable patches without a separate development environment.
CharmCharmTerminal-native AI with Charm UX focus. Best CLI agents must feel natural in the shell, not like a web app squeezed in.
ClineClineStarted as editor-integrated; CLI extends plan→act→observe→repair cycles to terminal-first usage.
CodebuffCodebuffFast repository edits. Competing on low-friction setup, fast context gathering, and clean diffs.
Command CodeCommand CodeNatural language to terminal and code actions. Judged by whether it keeps execution safe while making progress.
ContinueContinueEmphasizes open customization: configurable models, reusable context, workflows spanning editor and terminal.
DroidFactoryTargets the software-engineering loop directly: understand task, act in repo, run commands, report a validated result.
KilocodeKiloExtends the agentic coding ecosystem with CLI access. Terminal-first coding is the common denominator across tools.
KimiMoonshotMoonshot's Kimi models for developer workflows. Key for teams tracking non-US ecosystems and long-context coding.
KiroKiroSpec-driven workflow: agents operate against requirements, not just ad hoc prompts.
Mistral VibeMistralMistral models in a CLI coding loop. Part of the European and open-model push into practical software agents.
Qwen CodeAlibabaAlibaba's Qwen capabilities via terminal. Most important entry for evaluating open-weight and China-origin model stacks.
Rovo DevAtlassianAtlassian's advantage is ecosystem context: Jira, Confluence, Bitbucket, and the enterprise dev graph.

CLI coding agents are no longer a novelty layer on top of the editor. They are becoming the operational interface for software work: reading repositories, installing dependencies, editing files, running tests, debugging build failures, configuring services, and shipping patches from the same terminal where engineers already live.

The reason this category suddenly feels measurable is Terminal-Bench 2.1. As of June 2026, it is the benchmark the community points to when comparing terminal-native agents because it tests the thing that actually matters: can the agent use the shell to finish real work?

Why Terminal-Bench 2.1 Matters

Terminal-Bench 2.1 evaluates agents inside realistic terminal environments. Instead of asking a model to answer coding questions in isolation, it asks the full agent loop to complete tasks that look like day-to-day engineering work:

  • Package management: installing, upgrading, pinning, and repairing dependencies.
  • Build debugging: interpreting compiler output, fixing configuration, and rerunning checks.
  • Server configuration: editing runtime settings, validating ports, and confirming services boot.
  • Git operations: inspecting diffs, managing branches, applying patches, and keeping history clean.
  • Filesystem navigation: finding the right files, understanding project layout, and avoiding unrelated changes.
  • Test-driven repair: running targeted tests, reading failures, and validating fixes.

That is why Terminal-Bench 2.1 became the accepted anchor point. It rewards agents that can operate in a messy terminal, not just agents that can write a beautiful function in a chat box.

The June 2026 Race

The Terminal-Bench 2.1 leaderboard shows a tight race at the top. The exact ranks change as vendors update models, harnesses, tools, and safety policies, but the pattern is clear: the best CLI agents are converging around a shared operating model.

They all need five capabilities:

  1. Shell competence: understand command output, exit codes, processes, permissions, and environment state.
  2. Repository awareness: map files, symbols, tests, package scripts, and framework conventions before editing.
  3. Tool discipline: choose the smallest useful command, avoid destructive operations, and recover gracefully.
  4. Patch quality: make focused changes that match project style and do not create collateral damage.
  5. Verification loop: run the right checks, explain remaining risk, and stop only when the task is actually done.

The benchmark matters because these are not vibes. They are observable behaviors.

Reading the Leaderboard: Harness vs. Model

Each entry on the Terminal-Bench leaderboard has two parts: the Agent (the orchestration harness) and the Model (the underlying LLM).[¹] That distinction is what makes the results actionable rather than just interesting.

Separating the harness from the brain

When Anthropic releases a new model, the leaderboard answers a question that a single number cannot: is the model inherently better at terminal tasks, or is Claude Code's proprietary CLI wrapper doing the heavy lifting through hidden prompt engineering, auto-retries, and custom tool definitions?[²]

Terminus 2 as the control group

Terminus 2 is the open-source, bare-bones reference agent built by the Terminal-Bench researchers. It acts as a neutral control group — a standard, unopinionated bridge between any model and a terminal.[²] Running the same model through both its proprietary harness and the neutral Terminus 2 harness directly measures how much value the wrapper adds:

ModelHarnessScore
Claude 5 FableClaude Code83.1%
Claude 5 FableTerminus 280.4%
GPT-5.5Codex CLI83.4%
GPT-5.5Terminus 278.2%

Anthropic's Claude Code orchestration layer squeezes an extra 2.7 percentage points out of the same underlying model compared to a bare-bones implementation. OpenAI's Codex CLI gains 5.2 points over the Terminus 2 baseline on GPT-5.5.

Why the harness gap matters

Historically, many proprietary harnesses actually performed worse than Terminus 2 — heavy abstractions confused the models and increased failure rates.[²] The current leaderboard shows that both OpenAI and Anthropic have crossed that inflection point: their CLI wrappers now genuinely enhance foundation model performance rather than getting in the way.

This means the "best CLI agent" question has two answers. If you control the model choice, the official harness from the model's own vendor will likely outperform a custom wrapper. If you need model-agnostic flexibility, Terminus 2's results show how much raw capability each model brings on its own.

What the Leaderboard Does Not Tell You

Terminal-Bench 2.1 is the right anchor, but it is not the only buying criterion. A CLI agent can score well and still be wrong for a team if it fails your local constraints.

Evaluate every agent against your own workflow:

  • Repo fit: does it understand your language, framework, monorepo layout, and test commands?
  • Permission model: can it ask before risky commands and avoid destructive changes?
  • Model control: can you choose the model, provider, context size, and cost profile?
  • Tooling support: does it work with Git, package managers, containers, cloud CLIs, and MCP servers?
  • Reviewability: does it produce small diffs and clear explanations?
  • Enterprise needs: does it support logging, policy, secrets handling, and audit trails?

The Hybrid Workflow

Hybrid Terminals

Top engineering teams are increasingly avoiding the single-agent bet. Instead they route tasks by type, letting each tool do what it is best at.

A practical split for teams running parallel agents across isolated worktrees:

  1. Codex CLI for rapid issue generation, test generation, single-file scripts, and terminal automation — where speed and token cost are paramount.
  2. Claude Code for the heavy lifting: complex PRs, broad system refactoring, and tasks where human-in-the-loop oversight is required to ensure architectural integrity.

The benchmarks support this. Codex CLI leads on raw Terminal-Bench accuracy with GPT-5.5 (83.4%), while Claude Code's harness shows a tighter coupling to its underlying model across diverse task types. Neither is universally better — they are optimized for different parts of the engineering loop.

The Takeaway

The CLI agent market is now real because the work is measurable. Terminal-Bench 2.1 gave the community a shared way to compare terminal mastery, and the agents above show how quickly the category is expanding.

The winner will not be the tool with the flashiest demo. It will be the agent that can sit in a real repo, use the terminal safely, understand the project, make the smallest correct change, run the right verification, and hand you a clean diff.

That is the standard now: if it runs in a terminal, it has to prove itself in the terminal.


[¹]: Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks. https://arxiv.org/html/2601.11868v1

[²]: Codex CLI Stands Out as Only Agent Beating Default Terminus 2 Benchmark. https://aigazette.com/llms/codex-cli-stands-out-as-only-agent-beating-default-terminus-2-benchmark--a