Real-Time AI Agents: OpenAI's WebSocket Shortcut to Speed
Good, I have a sense of the tone. Now I'll write the article. ---...
Good, I have a sense of the tone. Now I’ll write the article.
If you’ve built anything with an LLM agent in the last two years, you’ve lived this problem: your clever multi-step AI loop spends more time on network round-trips and re-uploading context than it does on actual reasoning. OpenAI’s latest engineering post on WebSockets in the Responses API is a direct attack on that friction — and it’s more interesting than the quietly technical headline suggests.
This isn’t a model release. There’s no new benchmark to argue about. What OpenAI published is a detailed look at how they rebuilt the plumbing underneath Codex, their cloud coding agent, to make the whole thing dramatically less wasteful. The payoff is real, the approach is smart, and it signals where the serious infrastructure competition in AI is heading.
What Actually Changed
The old model for talking to a language model API is pure HTTP: you send a request, you get a response, connection closes. For a chatbot, this is fine. For an agent running dozens of sequential tool calls — read file, edit file, run tests, check output, repeat — it’s a disaster. Every single loop iteration means:
- Re-establishing an HTTP connection
- Re-transmitting your full system prompt
- Re-transmitting the accumulated conversation history
- Waiting for the server to re-process all of that before generating even the first token
With the Responses API’s WebSocket mode, the connection stays open across the entire agent session. That alone helps with connection overhead, but the more important piece is connection-scoped caching. OpenAI can now keep the KV cache (the internal representation of already-processed tokens) alive on the server side for the duration of your WebSocket connection. When your next agent step fires, the model doesn’t start from scratch — it picks up where it left off.
The result for Codex: meaningfully faster time-to-first-token on each iteration, and a substantial reduction in the tokens-processed-per-step cost since repeated context doesn’t need re-computation. The engineering post is careful not to splash specific numbers everywhere, but the directional claim is that this materially changes the economics and responsiveness of long-running agentic sessions.
Why the Timing Matters
OpenAI didn’t build this because it sounded cool. They built it because Codex needs it to not feel sluggish.
A coding agent that pauses three seconds between each tool call, where one second is network and two seconds is re-processing a 50,000-token context window, feels broken. Users abandon it. That latency compounds fast: a ten-step debugging loop with 2.5 seconds of overhead per step is 25 seconds of dead time just from infrastructure inefficiency. Add actual model thinking time and you’ve burned a minute on something that should feel instant.
WebSockets with connection-scoped caching collapse most of that overhead. This is table stakes for agents that people will actually use in production, and OpenAI is shipping it now because they’re betting that coding agents — not just coding assistants — are the next real battleground.
The Developer Angle
If you’re building on the Responses API, this changes your architectural calculus. Previously, the smart move was to carefully manage context window size to keep costs and latency under control. You’d prune conversation history aggressively, batch operations where possible, and try to minimize round-trips. Some of that still applies, but the cost of a long persistent context within a single session just got significantly cheaper.
What this enables practically:
- Longer agent loops without death-by-latency — agents can maintain richer context about what they’ve already tried without making each iteration painfully slow
- More responsive multi-step tool use — if your agent needs to call 15 tools in sequence, the fixed overhead per call drops dramatically
- Better economics for agentic products — server-side caching means OpenAI isn’t charging you full price to re-process the same system prompt on every call within a session
The catch: you’re now tied to a persistent connection, which means your infrastructure needs to handle WebSocket sessions instead of stateless HTTP. For most serious agentic frameworks, this is a minor plumbing change. For anyone who deployed a bunch of simple serverless functions to call the API, it’s more work. You need something that can hold a connection open — a persistent process, not a lambda that dies after 30 seconds.
How This Stacks Up Against the Competition
Anthropic has had explicit prompt caching in the Claude API for a while now — you annotate specific parts of your context as cacheable, and Anthropic stores them for up to five minutes (or longer with extended caching). It works, developers use it, and it meaningfully cuts costs on repeated system prompts. But it’s manual: you have to think about what to cache, mark it correctly, and manage TTLs.
OpenAI’s connection-scoped approach is more automatic. Within a WebSocket session, the caching is implicit — the connection itself is the cache scope. You don’t have to do anything clever. For developers who want to just plug in and go, that’s a better experience. For developers who want fine-grained control over exactly what gets cached and for how long, Anthropic’s model is more explicit and predictable.
Google has context caching in the Gemini API too, aimed at exactly this problem. Their approach also requires explicit cache creation with a defined TTL, similar to Anthropic’s model.
What makes OpenAI’s framing interesting is that they’re tying this directly to the agent session lifecycle. The WebSocket connection is the session. That’s a cleaner mental model for agent builders than thinking in terms of “I should cache these tokens for four hours.”
No one has completely solved the problem — agents still re-process more than they should, pricing models don’t fully reflect computational reuse, and the cross-session case (caching your system prompt across different user conversations) remains expensive to implement well. But OpenAI is making a credible engineering push on the within-session case, which is where most of the agentic pain actually lives.
The Bigger Picture
There’s a shift happening in how AI companies compete. The frontier model race isn’t going away — GPT-5, Claude 4, Gemini Ultra, whatever comes next — but infrastructure quality is becoming an increasingly important differentiator. Developers don’t just pick the smartest model anymore; they pick the one they can actually build on reliably and economically.
OpenAI’s WebSocket work is infrastructure competition. It’s boring to write press releases about, which is probably why it got a technical blog post instead of a keynote. But for anyone building production agents, it’s more immediately useful than a 3% benchmark improvement on MMLU.
The subtext of this whole announcement is also that Codex’s architecture — the system OpenAI is using for their own flagship coding agent — is now more exposed to third-party developers. OpenAI is eating their own cooking and then publishing the recipe. That’s usually a good sign that the technology is mature enough to trust.
Verdict
This is a solid, practical engineering improvement that addresses a real pain point. It’s not a paradigm shift — WebSockets aren’t new, and caching repeated context is an obvious optimization in retrospect. But “obvious in retrospect” is how most good infrastructure improvements work. The implementation details matter, and OpenAI’s connection-scoped model is an intelligently designed abstraction.
For developers building agentic products today: this is worth migrating to if you’re running long agent loops with the Responses API. The latency and cost benefits are real. Just make sure your infrastructure can actually maintain persistent WebSocket connections before you commit.
For the broader AI industry: watch for this pattern to spread. Google and Anthropic will both respond with more automatic, session-aware caching models over the next several months. The war for developer infrastructure is heating up, and latency-in-the-agent-loop is the next hill everyone’s going to fight over.
OpenAI didn’t announce anything flashy here. They just made their agents faster and cheaper to run. In the long run, that matters more than most things that come with a press release.
Sources
> Want more like this?
Get the best AI insights delivered weekly.
> Related Articles
DeepSeek Platform V4: The API Price War Goes Nuclear
DeepSeek's API stack was already one of the best value plays in AI. With V4 nearing launch, the cost gap versus Western frontier models looks even more disruptive.
Veo 3.1 Lite: Google's Bet That Cheap Video Generation Is the Real Unlock
Google just dropped Veo 3.1 Lite, its most cost-efficient video model yet. It won't dazzle you in a demo — but it might be the version that actually matters for building real products.
Quantum Computing Meets AI: What's Real, What's Hype, and What's Coming
Quantum computing promises to supercharge AI, but separating breakthroughs from buzzwords requires cutting through layers of hype. Here's the honest picture.
Tags
> Stay in the loop
Weekly AI tools & insights.