Codex CLI’s “harness” runs a loop of Responses API inference + tool calls, while optimizing for prompt caching and context-window limits. It keeps requests stateless for ZDR compatibility and uses compaction to keep conversations usable.
What actually happened
OpenAI detailed the agent loop behind Codex CLI and how it orchestrates user, model, and tools.
Codex builds a structured Responses API request (instructions, tools, input) rather than a raw prompt.
Each tool call result is appended, making the prior prompt an exact prefix of the next.
Codex avoids previous_response_id to keep calls stateless and support Zero Data Retention.
When token growth threatens the context window, Codex compacts conversation state.
Key numbers
User-instruction aggregation is limited to 32 KiB by default.
--oss mode supports ollama 0.13.4+.
--oss mode supports LM Studio 0.3.39+.
Why this was hard
Conversation history and tool outputs make prompts grow every turn, stressing context windows.
Prompt growth can make request payloads “quadratic” over long conversations.
Prompt caching requires exact prefix matches; small mid-thread changes can cause cache misses.
Tool ecosystems (e.g., MCP) can change available tools dynamically, breaking caching assumptions.
How they solved it
Send inference via the Responses API over HTTP and consume results as an SSE event stream.
Reuse prior prompt as an exact prefix by appending new events/tool results to input.
Optimize for caching by avoiding edits to earlier prompt items during a conversation.
Treat configuration changes as new messages (e.g., new developer/user message) rather than mutations.
Keep requests stateless (no previous_response_id), aligning with ZDR constraints.
Automatically compact state using the /responses/compact endpoint once auto_compact_limit is exceeded.
What changed
Compaction moved from a manual /compact command to automatic use of /responses/compact.
Codex preserves latent conversation understanding via compaction output that includes encrypted_content.
Why this matters beyond this company
If you want prompt caching, structure your agent loop so earlier prompt segments never change.
“Stateless by default” simplifies infrastructure and aligns with retention constraints, but grows payloads.
Plan for tool lists to be unstable; ordering and mid-thread tool changes can have real performance cost.
Stealable ideas
Append new context messages for config changes instead of rewriting earlier prompt items.
Keep static instructions/tools at the start; push variable content to the end.
Use compaction once token thresholds are exceeded rather than letting threads hit hard limits.
Treat SSE events as first-class inputs for the next inference call.