HelixML

Working with the Garage Door Up, Without a Door

Priya Samuel — Tue, 12 May 2026 13:16:49 GMT

There’s a thing many engineering teams do where design docs get written after the code is already merged, if they get written at all. We do it backwards.

Every Helix feature starts with three Markdown files on a public branch before any implementation starts: a requirements doc, a design doc, and a task breakdown. The agents write these. They take a feature request, draft the three docs, and put them up for human review like a teammate would. We push back, ask questions, suggest pivots. Only then do they start coding.

The reason it’s sustainable is that we build Helix with Helix. The agents drafting our design docs run on the same platform we ship to customers, against the same agent-driven workflow. When we improve the process, we improve the product. When we improve the product, the process gets cheaper. Writing up specs stops being the expensive part of building software when an agent does most of the drafting and a human does the steering.

It sounded performative when we proposed it. In practice it changed how we work, not just how the work looks from the outside. Mid-project architectural pivots become visible while there’s still time to reroute. Comments on design docs interrupt our implementation agents in real time, instead of queueing for review weeks later. Six months on, when the same shape of problem comes back round, the reasoning is searchable, instead of buried in a Slack thread no one can find.

And here’s the part we didn’t expect to land hardest: the agents themselves read the corpus back. When a new feature lands on an agent’s desk, it doesn’t start from zero. It starts from the accumulated design decisions of every feature that came before it. The team has memory. The agents inherit it. Spec-first turns out to be agent-first by side effect. And I’ve found that when a duplicate task arrives, the agent knows that when looks up the design docs.

Read the full post on helix.ml/blog

I benchmarked two approaches to code indexing for Kodit (which powers Helix Code Intelligence). The smarter one lost.

Phil Winder — Tue, 17 Mar 2026 17:26:20 GMT

Read the full Post on the Helix blog: https://helix.ml/blog/chunking-beats-slicing

I benchmarked two approaches to code indexing for Kodit (which powers Helix Code Intelligence). The smarter one lost. The "smarter" approach was program slicing — using a syntax tree to extract self-contained, structurally coherent code snippets rather than just cutting text into chunks. The theory was solid: slices capture real code structure, preserve function boundaries, include relevant dependencies. A basic RAG chunk might split straight through a critical function definition. I ran both against SWE-Bench Verified using mini-SWE-agent. Three conditions: a clean baseline (no Kodit), Kodit with slicing, Kodit with chunking.

----------------------------------------------------------------------
Metric                 Baseline    Kodit Pre 1.0    Kodit Post 1.0
----------------------------------------------------------------------
Instances evaluated          25               25                25
Resolved (passed)            12               11                15
Resolve rate                 48%              46%               60%

Chunking won by 14 points. Slicing came in *below* the baseline — it wasn't just not helping, it was actively getting in the way. Why? It comes down to how LLMs are actually trained. They're optimised to read files and write files. Program slices aren't files — they're synthetic constructs that don't map onto how the model processes information. Handing an LLM a syntax tree is like handing someone a book's index and expecting a book report. There's more to it than that, including caveats on sample size, what this means for Kodit's architecture going forward, and what the full 500-instance SWE-Bench run might show.
Full post on the Helix blog: https://helix.ml/blog/chunking-beats-slicing

Why Benchmarking AI Code Tools Is Harder Than You Think

Phil Winder — Thu, 05 Mar 2026 14:02:15 GMT

AI coding assistants are everywhere. Claude Code, Codex, Cursor, etc. Everyone wants to know which is “best”. You’ll find an infinite array of opinions and a thousand AI-generated “hot takes” that are neither hot and only take (the piss).

The natural instinct is to look at leaderboards. Some poor soul, somewhere, has taken the time to attempt to robustly benchmark these tools. I sincerely thank them for the effort because I appreciate how hard it is to do this well.

The Problem With Traditional Benchmarks

In general, the benchmarks are created under two high-level remits. The first is an academic exercise to find the state of the art. Academics then use these results to guide their research. This is a good thing and they should continue to do that. But these benchmarks are not representative of real-world scenarios. The second is a task-specific exercise where model or algorithm developers attempt to produce directional metrics that correlate with downstream performance. Again, as a long-term ML and AI practitioner, I appreciate the need to simplify problems to metrics that we can directly optimise for. But, again, these benchmarks are not representative of real-world scenarios.

All RAG benchmarks are based upon one-shot retrieval tasks, evaluated in isolation for retrieval accuracy. Nearly all coding benchmarks are based upon one-shot patch generation tasks, evaluated in isolation for patch correctness.

This isn’t how AI coding agents actually work.

Coding assistants work through trial and error. Much like in reinforcement learning, they explore their environment and anticipate the goals of the developer. They often make mistakes. These are sometimes caught by automated analysis (e.g. linting, tests, etc.). Sometimes they are not and need to be manually corrected. We can include base knowledge (e.g. CLAUDE.md) or external knowledge (e.g. a web search) to help the agent. All of these permutations aren’t tested, all of the time, by any of the benchmarks. This is a problem.

Modern Coding Benchmarks

HumanEval and SWE-bench are the two most popular coding benchmarks that are touted by every vendor.

HumanEval is probably the worst. Created by OpenAI in 2021, it consists of a function signature and a docstring describing what the function should do. It also contains a hidden set of unit tests that evaluate the correctness of the function. Ignoring the fact that these examples are now in every model’s training data, the main issue is that it’s a one-shot generation test. It’s the same as any QA challenge originally conceived way back in 2002.

Aside: this led to the best-named metric on the market, SacreBLEU, which is independent of tokenisation.

In 2023, Princeton researchers (subsequently OpenAI) released SWE-bench. It represented an important step up from HumanEval by drawing real-life examples from real pull requests. Each instance is codified as the commit just prior to the fix of the issue. The agent is given the issue description and access to the repository at that point in time. They have test cases again to test the correctness of the patch. For reference, initial basic one-shot RAG approaches achieved just 2% success. (Granted, this was Claude 2 and BM25 at the time...)

You’d think that would be the end of the story, because this almost represents what agents are doing in real life. But no.

The first problem is that OpenAI found that a whopping 59% of tests in a sample have “flawed” test cases that reject functionally correct patches. They also note that more recent models have (in)advertently learned to overfit the benchmarks, predicting the correct patch irrespective of the prompt, akin to Volkswagen changing emission profiles when it detected it was being tested.

Given a short snippet from the task description, GPT‑5.2 outputs the exact gold patch. In particular, it knows the exact class and method name, and the new early return condition if username is None or password is None that is introduced.

The second problem, and this is less relevant to model developers like OpenAI, is that Kodit allows the coding assistant to search for relevant work from other external resources and codebases. Kodit is not restricted to only searching the codebase under test. It can learn from others. This is a critical advancement in the enterprise domain where developers are often working across multiple codebases at the same time. An authentication implementation in one repo is likely very useful for another.

A final problem I have with nearly all benchmarks is that they are self-contained. In my experience, most coding tasks involve another library, framework, or system. None of these benchmarks ever say “add a new table to my SQLAlchemy application”, or “update the frontend to show the information in the new API”. They’re always leet-code style “implement quicksort” tasks; self-contained, using the base language only. And they’re often only in Python!

Why Kodit is Hard to Benchmark

Kodit is a multi-turn, multi-tool, multi-context assistant to a coding assistant. It’s hard enough to say, let alone benchmark! Kodit indexes external codebases to provide relevant context to any coding task. Exposed as an MCP server, it can be used by any coding assistant that supports the MCP protocol. In addition, it generates enrichments meant more for human consumption to help explain the inner workings of a codebase.

Given this flexibility, traditional information retrieval metrics don’t capture whether the context actually improved the solution. Success is measured downstream of Kodit, at the end of the coding task. So the question isn’t “did it find the right snippet” but instead should be “did this snippet lead to better code.” This means you need an end-to-end evaluation, more like SWE-bench, but with global context.

The next problem is one I’ve observed. I have seen situations where I know a quick Kodit lookup would help the assistant, but the coding assistant decided not to. It chose to search the web instead. Or worse, it just started writing code. In most cases I have to hack around this by telling the agent, in no uncertain terms, to use Kodit. Threats work well. But it’s tedious. Equally, I’ve seen coding assistants search for the wrong thing and go down a wasteful path.

So in the end, the “performance” of Kodit is often less about what it is able to do, but more about how well the agent can use it.

This realisation has led me to an important conclusion that I need to make Kodit simpler, more focussed, less smart. I am now actively working on simplifying the MCP interface and the internal search implementation.

What Does a Good Benchmark Look Like?

I am using SWE-bench verified to test and evaluate Kodit. Using the canonical SWE-bench coding agent, mini-swe-agent, I created a wrapper that adds Kodit as an attached MCP server and compared it against an agent without Kodit. And a script that indexes the commit under test (so the agent can’t just search for the correct answer in a subsequent commit). And it works; I’ll leave the actual metrics for another day. But it’s more like an end-to-end test than an evaluation. The agent can’t take advantage of Kodit’s key selling point: leveraging information from other codebases.

If anyone fancies a bit of light torture and wants to implement a benchmark themselves, then a good one would look like this:

End-to-end measurement of final code quality. Both functionally and non-functionally.
Multi-turn aware. Captures and evaluates the full agent trajectory, not just the final patch.
Able to compare with and without external context augmentation.
Accounts for cost or the number of tokens used.
Has realistic challenges. Not just bug fixes, but new features, framework and language migrations, version upgrades, integration with external systems, usage of popular external libraries, etc.
All the languages, not just Python!
Resistant to contamination. Uses private or freshly-created repos the model hasn’t seen.

Why Now

AI coding tools have now moved on from auto-complete. We seem to have skipped merrily through auto-assist and are already smack bang in the middle of auto-management. But we have no way to know how well these tools perform.

For Kodit, it’s hard for me to explain to my users by how much Kodit improves the coding assistant. Through experience I know it’s positive. Via demos I can see it working where it failed before. But it’s still incredibly hard to quantify.

But I’m actively working on this. Future posts will share more concrete results and learnings. For now, the main point is: be wary of the leaderboards and the opinions.

What a control room for AI coding agents actually looks like

Priya Samuel — Tue, 03 Mar 2026 14:07:55 GMT

What a control room for AI coding agents actually looks like

Picture ten items on your engineering backlog. A new feature. A framework migration. Four security patches. A batch of logging improvements across a dozen repos. You know the shape of every one of them. You could write specs for all of them this afternoon.

You can’t build them all this afternoon. Not with one developer. Not even with one very good AI agent.

Helix changes that equation. Not by making one agent faster, but by giving you a fleet of them, each working in its own GPU-accelerated desktop, coordinated through a Kanban board you can watch in real time.

Each agent gets its own computer

We covered this architecture in an earlier post, but it’s worth repeating here because it’s the foundation everything else builds on.

Every agent in Helix gets its own isolated desktop environment. Not a container with a language runtime. A full GPU-accelerated Linux desktop running the Zed code editor, a terminal, a browser, and its own filesystem. When you spin up five agents to work on five tasks, they’re running on five separate desktops. They can’t interfere with each other.

Each desktop appears as a separate machine, but underneath it’s a high-density Docker-in-Docker (or Docker-in-Kubernetes) setup sharing GPU resources. We did a lot of work on GPU virtualization with virtio-gpu and Vulkan passthrough to make multi-tenant desktops viable on a single physical machine.

The result is that you can watch your agents work. Literally watch them. You see the code editor, the terminal output, the browser window. When an agent opens Chrome to test the app it just built, you see Chrome open. When it reads an error and goes back to fix the code, you see that too.

The Kanban board

The orchestration layer is a Kanban board. Columns for backlog, planning, implementation, review, and done. Each card is a task. Each task gets an agent.

Move a card into the planning column and the agent spins up a desktop and starts writing a spec. As with Spec Driven development: requirements first, then technical design, followed by an implementation plan (spec). The agent writes these documents in Zed, and you can review them with inline comments, Google Docs style. Leave a comment saying “what about edge cases for deleted users?” and the agent responds to your comment and updates the design.

This is the workflow that’s changed how we build software internally. You batch up your thinking early. Leave comments on specs across five different tasks. The agents respond and iterate on the designs while you move on to the next review. When a design looks right, you approve it, and the agent shifts into implementation mode. It writes code, runs tests, and opens a pull request.

The approval of the implementation plan isn’t ceremonial. When you approve a spec, the agent receives a structured prompt telling it: “Your design has been approved. You’re now in the implementation phase.” File diffs show up in real time. The agent commits code, runs the app, and tests it. You’re reviewing finished pull requests, not babysitting the work.

Agents don’t talk to each other (on purpose)

The obvious question with multiple agents: if one miscommunicates something to another, how do you debug that?

Our answer is that they don’t communicate with each other. At all.

For coding tasks, where an agent needs to hold a coherent plan from spec to implementation, the communication overhead with multi agent communication buys you very little and introduces failure modes that are genuinely hard to debug.

So our agents are intentionally isolated. They coordinate the same way human developers do: through git. When an agent finishes its work and opens a pull request, it merges from main first. If there’s a conflict, it resolves it. That’s the coordination mechanism. It’s boring. It works.

Maybe one day it’ll make sense to have two agents pair-programming on the same desktop. But right now, isolated agents working in parallel on separate tasks, coordinating through version control, gives you the throughput gains without the chaos.

Do the work once, apply it everywhere

Some of the most valuable engineering work is also the most tedious: applying the same change across dozens of repositories.

Think about an organisation with 100 repos that share the same Python framework. Same patterns, same structure. A security patch or logging change needs to go into 30 or 50 of them. That work goes on the backlog. And it sits there. For weeks. Sometimes months.

Here’s what we built. You do the work once, in one repo, with one agent. During that process, the agent learns things you didn’t know at the beginning. The spec gets refined through actually doing the work. Then you clone that refined spec across the other 49 repos. The agents spin up in parallel, each working in its own desktop, each applying the same pattern to a different codebase.

Do one in an hour. Do 49 in ten minutes.

Not all of them land perfectly. You review a group view that shows progress across all the cloned tasks: which ones are done, which ones need attention, which ones have already been merged. But the ratio of human effort to output changes dramatically. Instead of a new hire spending a week getting through three of them, you’re reviewing pull requests across all 49 by lunchtime.

The acceleration curve

We have adopted this concept called the AI acceleration curve from Steve Yegge’s post from Jan 2026. It’s eight steps, from basic model inference all the way up to orchestrated agent fleets.

The instinct is to skip straight to step eight. Everybody wants the fleet.

But the reality is that a team running its first inference endpoints this quarter is the team that’ll be ready for agent fleets next year. Each step builds the organisational muscle, the infrastructure, the trust, that makes the next step possible.

Helix Coding agents are built for step eight. But Helix works at every step along the way. You can start with self-hosted inference and RAG. Add single-agent coding sessions when your team is comfortable. Move to multi-agent orchestration when you’ve seen enough to trust the workflow.

That’s not a compromise. It’s how platforms actually grow. You meet teams where they are. You solve the problem they have right now. And when they’re ready for the next level, the infrastructure is already there.

Try it

If you want to see where your team falls on the curve, or you just want to watch five AI agents build five different apps at the same time, we’d love to talk.

Porting a Code RAG system from Python to Go: What the AI got wrong

Phil Winder — Thu, 26 Feb 2026 14:58:33 GMT

Kodit started as a Python project. An MCP server and CLI for indexing code repositories, combining BM25 keyword search with vector embeddings and reciprocal rank fusion to give AI coding assistants the context they need. Python served well for prototyping: FastAPI, SQLAlchemy, Pydantic, and a rich ecosystem of ML libraries made it straightforward to build and iterate.

But Python added friction. Deploying Kodit into the Helix ecosystem meant shipping a Python runtime, managing pip dependencies, and accepting the performance overhead of an interpreted language on a search-heavy workload. Since Helix is a Go project, it was obvious that Kodit should be in Go too. The goal was feature parity with the Python version, plus something new: a clean Go client API so that Helix and other projects could import Kodit as a library, not just call it as a server.

This article is the story of that migration. What changed architecturally, what broke along the way, and what the new Go version means for users. For the generic methodology behind AI-assisted cross-language migrations, see the companion article on Winder.AI.

The Migration Approach

The full methodology is covered in the Winder.AI article, but the short version is this: I set up a monorepo with the Python source and Go target side by side, wrote two design documents (CLAUDE.md for domain context and coding standards, MIGRATION.md for an ordered task checklist), and used Claude Code to generate the Go implementation in an automated loop.

What was specific to Kodit was the domain modelling.

Bounded contexts. Kodit has four distinct areas: repositories (the code sources being indexed), enrichments and snippets (the indexed content and its metadata), search (the query pipeline), and configuration. Each maps to a directory in the Go codebase with its own domain, application, and infrastructure layers.

Ubiquitous language. Terms like enrichment, association, snippet, and embedding model have precise meanings in the Kodit domain. These were documented in a glossary in CLAUDE.md so the AI would use them consistently rather than inventing its own terminology. Getting this right matters: when the AI starts calling an enrichment a “document” or a snippet a “chunk”, the generated code drifts from the existing schema and APIs.

Layered architecture. The Go codebase follows a DDD-inspired structure: domain types have no external dependencies, application services orchestrate use cases, and infrastructure implementations handle persistence and external APIs. Layer rules are enforced by Go’s package system. The domain package never imports infrastructure.

This structural discipline paid off during the automated generation phase. With clear boundaries, the AI could generate code for one context without accidentally coupling it to another.

Architectural Decisions

Several important design decisions were made during the migration. Some were intentional. Some were discovered by accident.

Public API vs Internal

The AI defaulted to placing everything in Go’s internal/ directory. This is idiomatic Go: internal/ prevents external projects from importing your packages. But the whole point of this migration was to make Kodit consumable as a Go library. I needed Helix to be able to import Kodit’s search client, repository types, and configuration directly.

I discovered this problem halfway through the migration. Everything compiled. Tests passed. But nothing was importable from outside the module. The refactor to extract a proper public API surface was substantial. It required deciding which types and interfaces belonged in the public package, which stayed internal, and how the public client would wrap the internal application services.

The result is a clean Go client that any project can import:

import "github.com/helixml/kodit/client"

c, err := client.New(client.Config{
    BaseURL: "http://localhost:8080",
})

results, err := c.Search(ctx, client.SearchQuery{
    Query:      "authentication middleware",
    Repository: "myorg/myrepo",
    Limit:      10,
})

The lesson: define your public API surface before generating any code. If I had specified this in CLAUDE.md from the start, the AI would have structured the code around the public interface rather than burying everything in internal/.

The Snippets Resurrection

This was the standout domain failure of the migration.

In the early Python version, Kodit stored snippets in their own database table. Later, I consolidated the design: snippets became a type of unified enrichment, stored in the enrichments table with associations linking them to repositories and other enrichments. This simplified the schema to essentially two core tables: enrichments and associations. All content, whether a code snippet, a description, an embedding, or a repository reference, was an enrichment linked by associations.

But remnants of the old design remained in the Python codebase. Type hints referencing a Snippet model. Comments mentioning the snippets table. Variable names like snippet_results. The AI saw these, recognised that “snippet” was a core domain concept (it was in the ubiquitous language glossary, after all), and rebuilt the entire deprecated table and data access layer.

I only discovered the problem when I ran a migration test: importing real data from a running Python instance into the new Go version. The data migrated successfully (enrichments landed in the enrichments table), but searches returned zero results. The Go search pipeline was querying the snippets table, which was empty.

The fix was another refactor. “Snippet” touched nearly every layer of the codebase: domain types, repository interfaces, application services, API handlers, database queries. Every reference had to be redirected to the enrichments table and its association-based data model.

The lesson is twofold. First, clean up dead references before migration. If deprecated code exists anywhere in the source, the AI will find it and use it. Second, migration tests are essential. Smoke tests with fresh data are not sufficient. You need to test with real data from the previous version to catch schema-level regressions.

Configuration Scattering

The AI scattered configuration defaults and overrides across multiple files. A default embedding model in one package. An overridden batch size in another. Environment variable reads in a third. The Go version had no single place where you could see what the system’s configuration was, what the defaults were, or where values were being mutated.

The principle I enforced during refactoring: configuration should be set, defaulted, logged, validated, and mutated in exactly one place. In the Go version, this is the config package. Application services receive their configuration at construction time and never read environment variables or apply defaults themselves.

In-Memory Pagination

The AI initially created list endpoints that loaded all records from the database and paginated in memory. An obvious and stupid error.

I caught this during code review and required proper LIMIT/OFFSET queries flowing from the API layer through the application service into the database query. The pagination parameters are defined at the API boundary and propagated down to the DB.

The broader pattern here is that AI-generated code tends to take the path of least resistance. Loading everything and slicing in Go is simpler to write because the infrastructure is already there. Doing it the right way, threading pagination parameters through three layers, touches a lot of code. If you care about performance at scale, you need to specify these constraints in the design.

Testing and Validation

Building confidence in the new version required multiple layers of testing. No single strategy was sufficient on its own.

Unit Tests

These tests are fast and catch regressions in individual components, but they said nothing about whether the system worked end-to-end. In the first version I focussed more on representative, real life end-to-end and smoke tests.

Smoke Tests

I created a pair of smoke test suites: one targeting the Python version, one targeting the Go version, both executing the same sequence of operations. Index a repository. Create enrichments. Run a search. Compare results.

These smoke tests caught wiring issues that unit tests could not: missing middleware, incorrect route registrations, serialisation differences between FastAPI and Go’s HTTP handlers.

After creating a Python-era postgres dump, I wrote a new smoke test to ingest this and test other end-to-end workflows.

API Parity via OpenAPI

A test that compares the OpenAPI specification generated by the Go version against the Python version. This caught missing endpoints, wrong parameter types, incorrect response schemas, and structural differences that would break existing clients.

If you are migrating a web API, this test is essential. It provides a machine-readable contract between the old and new implementations.

Ranking Comparison

The most revealing test was a direct side-by-side comparison of search results. I ran the same queries against both versions and compared the ranked output.

The results were initially wrong. Completely wrong. The investigation uncovered multiple issues:

Truncation error. When converting embeddings to VectorChord’s database format, the Go version was incorrectly truncating the float arrays. Dimensions were being lost.
RRF indexing error. The reciprocal rank fusion implementation had an off-by-one error when combining BM25 and semantic rankings.
Wrong embedding read. The AI had added unrequested functionality to read multiple embedding formats from disk. This caused it to load the wrong embedding for a given snippet, producing nonsensical similarity scores.

Each of these passed unit tests in isolation. Only the end-to-end ranking comparison revealed the compounding effect.

There was a silver lining. During this debugging, Claude noticed that the codebase was using L2 (Euclidean) distance rather than cosine distance for vector similarity. This was likely degrading results in the Python version too. A genuine improvement discovered by accident.

Migration Test

Testing with real data migrated from the old Python database to the new Go schema. This is what caught the snippets table regression described above. If you are rewriting a system that has existing production data, migration tests are non-negotiable. They test the one thing smoke tests cannot: whether the new system correctly handles legacy data.

What the AI Got Wrong

To be specific about where the AI failed on this project:

Resurrecting deprecated features. The snippets table rebuild was the most expensive failure. The AI saw domain references, inferred importance, and recreated dead functionality. The fix touched dozens of files.

Dead code accumulation. After refactoring from internal/ to a public API, orphaned packages remained. They appeared used because other orphaned packages imported them. Identifying dead code required understanding the full dependency graph, which the AI could not do unprompted.

Excessive functionality. The AI added features not present in the Python version: multiple embedding format readers, alternative search strategies, extra configuration options. Each addition introduced potential bugs with zero user value.

Missing end-to-end wiring. Individual components worked. The application as a whole did not start correctly the first time. The AI generated each piece but never ran the server. Wiring errors (missing dependency injection, incorrect initialisation order) only appeared when the full system was assembled.

The New Kodit

What users and integrators get from the Go version:

Go client library. Import github.com/helixml/kodit/client and use Kodit programmatically. Search, index repositories, manage enrichments, all through typed Go functions. This is the foundation for the Helix integration.

Same interfaces. The MCP server and CLI behave identically to the Python version. Existing users should see no difference in their workflow.

Database compatibility. SQLite for local-first usage. VectorChord/PostgreSQL for enterprise scale. The Go version supports both, matching the Python version’s flexibility.

Performance. The Go version benefits from compiled execution and Go’s concurrency model for parallel indexing and search. Formal benchmarks are forthcoming, but my initial testing is reporting a 5x performance improvement during testing was noticeably faster for large repositories.

What’s Next

The migration itself inspired new functionality. The dead code and orphaned package problems I encountered manually are exactly the kind of issues Kodit should detect automatically. Dead code detection and duplication analysis are on the roadmap. I also want to get back to benchmarking and indexing improvements.

The Helix integration is underway, with Kodit’s Go client providing native code search within the Helix platform. Community contributions are welcome, particularly around new enrichment strategies and search pipeline improvements.

The Kodit repository is open source. Issues, discussions, and pull requests are the best way to get involved.

Conclusion

The rewrite was worth it. The Go version is cleaner, faster to deploy, and designed for library consumption from the start. The AI-assisted approach compressed what would have been months of manual translation into about three weeks, but it required constant human oversight of architecture and domain correctness.

The biggest lesson is this: AI coding assistants are powerful translators but poor architects. They will faithfully convert Python patterns to Go patterns, function by function, file by file. But they cannot see the system as a whole. They cannot question whether a deprecated table should be rebuilt. They cannot decide which packages should be public. They cannot judge whether in-memory pagination is acceptable at scale.

How We Forked Zed and Added Remote Control for Agent Fleet Orchestration

Chris Sterry — Wed, 25 Feb 2026 18:25:08 GMT

Zed is a fast, GPU-accelerated code editor written in Rust. It has excellent LSP support, a growing agent panel, and a clean architecture. It also has no concept of external orchestration — and that’s where our problem started.

Helix runs fleets of coding agents. Each agent is a headless Zed instance running inside a Docker container, connected to an LLM via the Agent Control Protocol (ACP). A central API dispatches tasks, monitors progress, manages thread lifecycles, and streams results back to users in real time. None of that is possible with stock Zed — so we forked it and added a WebSocket control plane.

This post covers what we built, the bugs that nearly broke us, and how we got streaming performance from O(N²) down to O(delta).

What We Needed From the Fork

Three capabilities required forking:

Remote command injection — the API must be able to send chat messages, simulate user input, and query UI state in a running Zed instance, with no human at the keyboard.
Event exfiltration — Zed must report back when a thread is created, when messages stream in, when the agent finishes, and when errors occur.
Multi-thread lifecycle management — when a thread exhausts its context window, Helix starts a new one on the same WebSocket connection. Zed must handle multiple concurrent ACP threads per connection.

The WebSocket Sync Protocol

The control plane is a single bidirectional WebSocket between the Helix API and each Zed instance. The API side lives in websocket_external_agent_sync.go; the Zed side in crates/external_websocket_sync/.

Server → Zed (commands):

Zed → Server (events):

Every message that touches a thread carries acp_thread_id for correlation. The request_id field ties a command to its eventual thread_created and message_completed events, so the API can track which user request produced which response.

Architecture

Helix Frontend
      |
      | HTTP POST /api/v1/sessions/chat
      v
Helix API  ----WebSocket----> Zed (headless, in container) ---ACP---> LLM
      |                              |
      | pubsub (session_update,      | thread events
      | interaction_update)          | (message_added, etc.)
      v                              |
Helix Frontend <----WebSocket--------+

The API maintains a map of acp_thread_id to Helix session IDs. When a user sends a message, the API creates an Interaction record with the user’s prompt, then dispatches a chat_message command over the WebSocket. Zed creates or reuses an ACP thread, the LLM streams its response, and Zed relays each chunk back as message_added events. The API accumulates these into the Interaction’s response and publishes real-time updates to the frontend.

When context exhausts, Helix sends a new chat_message without an acp_thread_id, prompting Zed to create a fresh thread. The new thread_created event maps it back to the same Helix session. One WebSocket connection manages the full lifecycle.

Bug 1: The Multi-Message Accumulation Problem

Zed’s agent panel produces multiple distinct entries per response turn: an assistant message, one or more tool calls, and a follow-up message. Each entry has its own message_id. Within a single entry, Zed streams cumulative content updates — the full content so far for that entry, not deltas.

The original code stored the response as a single string and overwrote it on each message_added event:

// The bug
interaction.ResponseMessage = content

Fine when there’s one message_id. With multiple entries:

message_added(id="msg-1", content="I'll help you with that.") → response = "I'll help you with that."
message_added(id="msg-2", content=" `tool\nedit")` → response = `"` tool\nedit" (msg-1 gone)
message_added(id="msg-2", content=" `tool\nedit file.py\n` ") → correct overwrite of msg-2, but msg-1 is still gone

The fix tracks the byte offset where each message_id‘s content begins. Same ID → replace from offset. New ID → append with separator, record new offset:

type MessageAccumulator struct {
    Content       string
    LastMessageID string
    Offset        int // byte offset where current message_id starts
}

func (a *MessageAccumulator) AddMessage(messageID, content string) {
    if a.LastMessageID == "" {
        a.Content = content
        a.Offset = 0
        a.LastMessageID = messageID
        return
    }

    if a.LastMessageID == messageID {
        // Same message streaming -- replace from offset, keep prefix
        a.Content = a.Content[:a.Offset] + content
        return
    }

    // New distinct message -- record offset, append with separator
    a.Offset = len(a.Content) + 2 // account for "\n\n"
    a.Content = a.Content + "\n\n" + content
    a.LastMessageID = messageID
}

Zed sends cumulative content per message_id (overwrite semantics), but the overall response is an append-only sequence of distinct message IDs. The accumulator handles both with a single offset tracker.

Bug 2: The Completion Hang

Users reported that responses would stream correctly but never show as complete — the loading spinner hung indefinitely.

The handler for message_completed published session_update events to the frontend. The frontend’s session_update handler has rejection logic: it checks whether the incoming session has the expected number of interactions and drops events that fail validation. A safeguard against stale data from out-of-order WebSocket messages — but it meant completion events were intermittently discarded.

The fix was to publish through both channels:

// 1. interaction_update -- same channel used during streaming
//    ensures useLiveInteraction sees state=complete
err = apiServer.publishInteractionUpdateToFrontend(
    helixSessionID, helixSession.Owner, targetInteraction, messageRequestID)

// 2. session_update -- full session for React Query cache consistency
err = apiServer.publishSessionUpdateToFrontend(
    reloadedSession, targetInteraction, messageRequestID)

The interaction_update path targets a specific interaction rather than the full session, bypassing the rejection logic entirely. That’s the reliable path for completion signals.

Shared Protocol Code: Eliminating Test Drift

The original end-to-end tests used a Python mock WebSocket server that reimplemented the sync protocol. The accumulation bug above didn’t appear in tests because the Python mock had its own (simpler) message handling. Tests passed. Production broke.

The solution: extract a shared wsprotocol Go package that both the production Helix server and the Go test server import. Same parsing, same accumulation logic, same event dispatch. If the accumulator has a bug, the test catches it because it runs the same code path.

The package has four components. MessageAccumulator — the append/overwrite logic above. Protocol — manages the WebSocket lifecycle, reads and parses messages, dispatches to handlers. EventHandler interface — the seam between shared protocol code and environment-specific behavior:

type EventHandler interface {
    OnAgentReady(conn *Conn, sessionID string) error
    OnThreadCreated(conn *Conn, sessionID string, evt *ThreadCreatedEvent) error
    OnMessageAdded(conn *Conn, sessionID string, evt *MessageAddedEvent, accumulated string) error
    OnMessageCompleted(conn *Conn, sessionID string, evt *MessageCompletedEvent) error
    OnUIStateResponse(conn *Conn, sessionID string, evt *UIStateResponseEvent) error
    OnThreadLoadError(conn *Conn, sessionID string, evt *ThreadLoadErrorEvent) error
    OnRawEvent(conn *Conn, sessionID string, msg *SyncMessage) error
}

Production implements this with database writes and pubsub. Tests use in-memory tracking and assertions. The OnRawEvent escape hatch handles Helix-specific events without bloating the shared interface.

Adding a new event type: (1) add a struct to types.go, (2) add a case to dispatch, (3) add a method to EventHandler. Both production and test code get the change, or neither does. No more protocol drift.

Streaming Performance: O(N²) to O(delta)

This was the most significant engineering challenge.

Streaming from Zed isn’t like streaming raw LLM output. An LLM token stream is purely append-only. Zed’s agent panel isn’t — a single response turn contains an assistant message, tool calls with status indicators, and follow-up messages, all interleaved. Those status indicators mutate in place mid-stream: **Status: Running** becomes **Status: Completed**. Content can change anywhere, not just at the end.

The naive approach — send the full accumulated response on every update — worked, but scaled badly. On every message_added event (dozens per second during fast token streaming), the API would:

Query the database for the session
Query the database for the interaction
Write the updated interaction back
Serialize the entire interaction as JSON and publish it to the frontend

For a 100KB response, this meant pushing 100KB over the WebSocket on every token. By the end of a long response, the browser was doing megabytes of string copying per second and the UI would visibly lag.

Caching and throttling (Go side): A streamingContext struct caches the session and interaction for the lifetime of a streaming response, eliminating two database round-trips per token. Database writes are throttled to one every 200ms — the in-memory state always has the latest content, but we only flush to Postgres periodically. message_completed always writes the final state, so at most 200ms of content is lost on a crash. Frontend publishes are throttled to one every 50ms, since the frontend batches to requestAnimationFrame (~16ms) anyway.

Patch-based deltas: Instead of sending the full interaction JSON on every update, the API computes a patch — the byte offset of the first change and the new content from that point forward. In the common case (pure append), the fast path fires: check that the new content starts with the previous content, return the offset and the suffix. One string prefix comparison.

For backwards edits (tool call status changing), the slow path finds the first differing rune.

The frontend receives interaction_patch events and applies them directly to a ref, bypassing React state during streaming. Multiple patches between animation frames are coalesced. The React Query cache isn’t touched until completion.

Wire traffic: O(N) per update → O(delta). For a 100KB response where each token adds ~20 bytes, that’s roughly a 5000x reduction per update.

Bug 3: The UTF-16 Offset

The first deployment of the patch protocol produced garbled text. Users saw "de Statussktop" where "desktop" should have appeared. Content in the database was correct — corruption was purely in rendering.

The root cause: computePatch returned byte offsets (Go’s len() counts bytes), but JavaScript string.slice() operates on UTF-16 code units. The streaming content contained 147 instances of › (U+203A, RIGHT SINGLE ANGLE QUOTATION MARK — Zed uses this as a breadcrumb separator in tool call output). Each › is 3 bytes in UTF-8 but 1 UTF-16 code unit, creating a cumulative offset divergence of 294 bytes. When a backwards edit occurred — a tool call status change — the patch was spliced into the wrong position.

The fix iterates by rune and tracks UTF-16 code unit position:

func utf16RuneLen(r rune) int {
    if r >= 0x10000 {
        return 2 // surrogate pair
    }
    return 1
}

The slow path decodes runes from both strings in lockstep, accumulating utf16Off alongside byteOff. Supplementary plane characters (emoji like 📤) count as 2 UTF-16 code units.

Zed-side throttling: Zed fires an EntryUpdated event on every LLM token. At high token rates, that’s hundreds of message_added messages per second, most of them redundant since the Go side only publishes every 50ms anyway. A 100ms throttle in Zed’s thread_service.rs buffer,s intermediate update,s and flushes before every message_completed. Nothing is dropped; wire traffic drops by ~90%.

The overall shape of the work: fork a fast editor, add a protocol layer, find three distinct bugs each caused by a different mismatch between assumptions (overwrite vs. append semantics, session-level vs. interaction-level events, byte offsets vs. UTF-16 code units), then fix the performance problem that only appears at scale. Standard distributed systems work, with a Rust/Go language boundary making everything a bit more interesting.

Code is available at github.com/helixml/helix.

How We Made Docker Builds 193x Faster: From 45 Minutes to 14 Seconds

Chris Sterry — Tue, 24 Feb 2026 14:40:44 GMT

The Problem

Helix runs AI coding agents inside isolated desktop containers — each agent gets its own GNOME desktop with a full IDE, Docker daemon, and development environment. When an agent needs to build a project, it runs docker build inside its container.

The problem: every new agent session started with a cold Docker build cache. The containers are ephemeral — when a session ends, the container is destroyed along with its Docker state. For a project like Helix itself (which compiles a Rust IDE, Go APIs, Python services, and a Node.js frontend), a cold build takes 43 minutes. That’s 43 minutes of an agent sitting there waiting for builds before it can start working.

This matters because multiple agents regularly clone the exact same source code. Ten agents working on ten different tasks in the same repo all need to build the same base images. Without shared caching, that’s 10 * 43 minutes = 7 hours of redundant compilation.

The Architecture

The container nesting looks like this:

Host Machine
└── sandbox-nvidia (Docker-in-Docker host)
    ├── helix-buildkit (shared BuildKit instance)
    │   └── buildkit_state volume (persistent cache)
    ├── helix-registry (shared Docker registry)
    │   └── registry_data volume (layer-level transfer cache)
    ├── agent-session-A (desktop container)
    │   └── local dockerd → builds route to shared BuildKit
    ├── agent-session-B (desktop container)
    │   └── local dockerd → builds route to shared BuildKit
    └── agent-session-C ...

Each desktop container runs its own Docker daemon (for isolation), but all builds route to a shared BuildKit instance at the sandbox level. The BuildKit cache is stored on a persistent Docker volume that survives container restarts.

The key insight: when Agent B builds the same Dockerfile that Agent A already built, BuildKit says “I already have all these layers cached” and the build completes instantly. The cache is content-addressed — identical inputs produce identical cache keys regardless of which container initiated the build.

The `--load` Bottleneck

Shared BuildKit got us halfway there. Builds were fast (~0.5 seconds for fully cached images), but there was a catch: the image still needed to be loaded into the local Docker daemon.

When using a remote BuildKit builder, docker buildx build --load exports the built image as a tarball, streams it over gRPC to the client, and imports it into the local daemon. This happens even when every layer is cached and the image hasn’t changed at all.

For a 7.73GB image (our desktop base image with GNOME, IDE, and dev tools):

That’s 10 seconds to transfer an image that didn’t change. The --load flag serializes the entire image into a Docker-format tarball, streams it over gRPC, and the receiving daemon deserializes and imports every layer — even layers it already has. There’s no layer-level deduplication in the tarball transfer path.

This adds up: building Helix involves 6+ images. Even with a hot BuildKit cache, the --load overhead per image turns a sub-second build into a 10-second wait, and the full stack build takes ~23 seconds of mostly --load transfers.

Smart `--load`

The first optimization: don’t load the image if it hasn’t changed.

docker build -t myapp:latest .
  └── wrapper intercepts
      1. Build with --output type=image --provenance=false --iidfile /tmp/iid
         → BuildKit resolves all layers (cached: ~0.5s)
         → Writes image config digest to iidfile
         → No tarball transfer (--output type=image stores in BuildKit only)
      2. Compare iidfile digest with local daemon's image ID
         → docker images --no-trunc -q myapp:latest
      3. Match? → Skip --load. "Image unchanged, skipping load"
         Differ? → Use registry push/pull for layer-level transfer

A transparent wrapper at /usr/local/bin/docker intercepts both docker build and docker buildx build, applying this logic automatically. No code changes needed in build scripts, Makefiles, or CI pipelines.

Three Critical Details

1. --iidfile is empty without an output mode on remote builders.

docker buildx build --iidfile /tmp/iid -t foo . with a remote builder produces an empty iidfile. BuildKit doesn’t compute the image config digest unless it actually exports something. The fix: --output type=image tells BuildKit to create the manifest in its internal store (instant for cached builds, no data transfer) and populates the iidfile.

2. --provenance=false is required.

With default provenance, BuildKit wraps the image manifest in a manifest list that includes an attestation document with build timestamps. The iidfile gets the manifest list digest, which changes every build (because the timestamp changes). With --provenance=false, the iidfile contains the bare image config digest — deterministic and matching what docker images --no-trunc -q returns.

3. The wrapper must handle both docker build and docker buildx build.

Docker 29.x’s docker build ignores the default buildx builder entirely — it always uses the local daemon’s built-in BuildKit. Only docker buildx build honors the configured builder. The wrapper rewrites docker build to docker buildx build (to use the shared cache) and applies smart --load (to avoid the tarball transfer).

Registry-Accelerated Loading

Smart --load eliminates the transfer when nothing changed. But when code does change, even a one-line change in the top layer of a 7.73GB image still triggers a full tarball --load (~10s). The tarball format doesn’t support layer-level deduplication — it’s all or nothing.

We solved this with a shared Docker registry running alongside BuildKit on the sandbox network. When the wrapper detects an image has changed, instead of --load:

Push to the registry — BuildKit pushes only the changed layers (~0.1s)
Pull from the registry — the local daemon checks which layers it already has, downloads only the new ones (~0.5s)

The Docker registry protocol does layer-level dedup natively. For a 7.73GB image with 95 base layers and 1 changed layer, the pull shows 95 “Already exists” and downloads only the single new layer.

Benchmarks: 1-line change in top layer of 7.73GB image

Measured E2E inside a real desktop container, 3 runs each:

The three paths compose naturally:

Image unchanged → skip load entirely (314ms)
Image changed, registry available → push/pull via registry (871ms)
Image changed, no registry → fall back to tarball --load (10s)

Results

There are two cases that matter: cold start (first agent to build a project) and warm start (subsequent agents building the same source).

Cold start: ~10 minutes (down from 45 minutes)

A fresh agent session starts with an empty Docker daemon — no images, no layers. Even though every build is a cache hit in shared BuildKit (the compilation is instant), the images still need to be transferred into the local daemon. For Helix-in-Helix, this is a deeply nested pipeline:

The cold start is dominated by image transfer, not compilation. BuildKit resolves all layers instantly (cached), but loading 7+ GB images into each nesting level takes time. The bottleneck is the --load tarball path: it serializes the entire image regardless of what the receiving daemon already has.

The nesting makes this worse: Helix-in-Helix has the desktop container (L2) building an inner sandbox (L3), which needs the same 7.24GB desktop image transferred again to a fresh daemon one level deeper.

Warm start: 23 seconds (124x faster)

Once images exist in the local daemon, subsequent builds are near-instant:

Smart --load checks the image digest against the local daemon (~0.3s) and skips the transfer when nothing changed. This is the common case: agents working on the same codebase where the base images haven’t been modified.

Incremental changes: ~1 second per image

When code actually changes, the registry-accelerated load transfers only the changed layers:

A one-line Go change rebuilds only the final compilation layer (~30s) and transfers only that layer via the registry (~1s) instead of the entire 43-minute pipeline.

Compose Build Interception

There was a gap in the smart --load optimization: docker compose build bypassed it entirely.

Docker Compose invokes BuildKit through its own Go API, not through the CLI. Our wrapper intercepts docker build and docker buildx build, but compose calls buildx bake internally — so smart --load never fires. Every compose build did a full tarball --load, even for unchanged images.

The fix: the wrapper now intercepts docker compose ... build, parses the compose config to extract each service’s build definition, and builds them individually through the existing smart --load path:

docker compose -f docker-compose.dev.yaml build
  └── wrapper intercepts (compose + build detected)
      1. $REAL_DOCKER compose config --format json
         → extract services, image names, build contexts, Dockerfiles, args
      2. For each service with a build section:
         → docker buildx build -t $IMAGE -f $DOCKERFILE $CONTEXT
         → smart --load: skip if unchanged, registry push/pull if changed
      3. Compose up finds the images locally.

Results:

Not as dramatic as the other optimizations, but 6 seconds saved on every warm build adds up across thousands of agent sessions.

The Golden Docker Cache: Eliminating Cold Start Entirely

Smart --load, registry-accelerated transfers, and compose interception transformed warm starts from 45 minutes to 23 seconds. But the cold start — the first agent session for a project — still took 10 minutes. Every image had to be transferred into an empty Docker daemon, even though BuildKit compiled nothing.

We wanted cold start to feel like warm start. Zero penalty for being the first session.

The idea

When code merges to main, automatically spin up a desktop container, run the project’s startup script (which builds all the Docker images), then snapshot the entire /var/lib/docker directory. When a new session starts, copy that snapshot — the “golden cache” — into the session’s Docker data directory. The local daemon starts with all images pre-populated. No builds, no transfers, no waiting.

Why it captures everything

Docker’s data directory contains everything the daemon needs:

Image layers (overlay2/) — all built images, all layers
Docker volumes (volumes/) — inner registries, BuildKit state, nested Docker data
Container metadata — not useful (containers don’t survive restart), but harmless

For a project like Helix-in-Helix, the golden cache even includes the inner sandbox’s Docker data (stored as a Docker volume within the session’s daemon). The inner sandbox starts with its images pre-populated too — no transfer through the inner registry needed.

The build is just a startup script run

Golden builds are beautifully simple: they’re regular desktop containers with one special environment variable (HELIX_GOLDEN_BUILD=true). The container clones the repo, checks out main, runs the startup script, then exits. The workspace setup script detects the golden mode and skips launching the IDE — just runs the startup script in the foreground and exits with its return code.

No new build system. No image manifest parsing. No layer-level copying. The startup script already knows how to build the project. We just run it once and keep the result.

Per-project, automatic, incremental

Each project gets its own golden cache, scoped by project ID:

/container-docker/
├── golden/
│   ├── prj_abc123/docker/    ← Project A's golden (8.7 GB)
│   └── prj_def456/docker/    ← Project B's golden (3.2 GB)
└── sessions/
    └── docker-data-ses_xyz/docker/  ← copied from golden at session start

Golden builds trigger automatically when code merges to main (via PR merge or internal approve-implementation). They’re debounced per-project — if a build is already running, additional merges are skipped. And critically, they’re incremental: each golden build starts from the previous golden cache, so only changed images need rebuilding. A typical incremental golden build takes 30 seconds to 2 minutes, not 10 minutes.

The overlayfs false start

Our first approach was elegant on paper: use overlayfs with the golden as the read-only lower directory and a per-session upper directory for copy-on-write. O(1) mount time, true COW semantics, minimal disk usage.

It didn’t work. Docker’s overlay2 storage driver creates its own overlayfs mounts inside /var/lib/docker/overlay2/. Nested overlayfs requires the upper directory to be on a non-overlayfs filesystem — our merged directory was itself overlayfs, so Docker failed with invalid argument. This is a kernel-level restriction, not a configuration issue.

The copy approach that actually works

We switched to cp -a: copy the entire golden directory to the session’s Docker data directory at session start. Less elegant than overlayfs, but it works reliably and performs well enough:

13.8 seconds to go from empty daemon to 8.7 GB of pre-built images. Compare that to 10 minutes of building and transferring through nested daemons.

Staleness is handled gracefully

What if code changes after the golden was built? The session starts with slightly stale images, but the smart --load optimization handles it transparently. When the startup script runs docker build, the wrapper checks the image digest against BuildKit — if it’s changed, the registry push/pull transfers only the changed layers (~1 second). The golden provides a warm baseline; the wrapper handles the delta.

The golden rebuilds on the next merge to main, so staleness is bounded by the development cycle.

The Full Picture

Here’s where we ended up, starting from 45 minutes:

Cold start: 14 seconds (from 10 minutes, from 45 minutes)

PhaseOriginalSmart --loadGolden cacheAPI + frontend (compose)200s41s0s (pre-built)Zed IDE + desktop image459s132s0s (pre-built)Inner sandbox setup2,075s380s0s (pre-built)Golden copy——14sTotal45 min10 min14sSpeedupbaseline4.5x193x

Warm start: 23 seconds (unchanged)

The warm start didn’t change — it was already fast from smart --load. The golden cache’s value is making cold start match warm start.

Incremental golden builds: 30s–2 min

Golden builds start from the previous golden, so they only rebuild what changed:

Implementation

The system has four components working together:

Docker wrapper — installed at /usr/local/bin/docker in each desktop container. Intercepts docker build, docker buildx build, and docker compose build. Routes builds through shared BuildKit, applies smart --load with registry acceleration, decomposes compose builds into individual smart builds. Falls back to tarball --load if the registry is unavailable.
Shared BuildKit + Registry (api/pkg/hydra/manager.go) — Hydra starts a helix-buildkit container (shared build cache) and a helix-registry container (layer-level transfer) at the sandbox level. Both are on the same Docker network as desktop containers. BuildKit is configured to trust the insecure registry for push operations.
Init script (desktop/shared/17-start-dockerd.sh) — configures the desktop container’s dockerd to trust the insecure registry and exports HELIX_REGISTRY and BUILDX_BUILDER globally so the wrapper knows where to push/pull and which builder to use.
Golden build service (api/pkg/services/golden_build_service.go, api/pkg/hydra/golden.go) — manages golden cache lifecycle. The API-side service triggers builds on merge-to-main, tracks build status in project metadata, and debounces concurrent builds. The Hydra-side code handles golden directory management, session-to-golden promotion, and the cp -a copy on session startup.

The wrapper is generic — it works for any docker build workload, not just Helix. It auto-detects whether the active builder is remote, and only applies smart --load when it is. On a standard local Docker setup, it’s a transparent passthrough.

What We Built

We started with a simple problem — Docker builds are slow when every agent starts cold — and ended up building something genuinely interesting: a multi-layered caching system that operates transparently across nested Docker daemons, shared build caches, and per-project golden snapshots.

The numbers tell the story:

An agent can now start working on a project in under 30 seconds, regardless of whether it’s the first session or the hundredth. The difference between 45 minutes and 14 seconds isn’t incremental — it changes what’s practical. Agents can spin up, do focused work, and tear down without the overhead dominating the task. Short-lived sessions become viable. Parallel agents become economical.

And the best part: it’s all transparent. Build scripts, Makefiles, docker-compose files — none of them changed. The wrapper intercepts standard Docker commands and applies the optimizations automatically. Projects opt into golden cache warming with a single toggle, and the system handles the rest.

GPU Virtualization Architecture for Multi-Desktop Containers

Luke Marsden — Mon, 16 Feb 2026 10:38:58 GMT

Overview

Helix Desktop runs multiple isolated Linux desktop environments (each with its own GNOME Shell, IDE, and browser) inside a single QEMU virtual machine on Apple Silicon Macs. Each desktop gets its own virtual GPU output, H.264 video stream, and DRM lease — all sharing one physical GPU through virtio-gpu with Vulkan passthrough via Venus/virglrenderer.

This document describes the full architecture from silicon to pixel, and the deadlock bugs we found and fixed when scaling from 1-2 desktops to 4+.

Why This Matters

AI agents are getting good enough to write real code, but they still need somewhere to run it. Not just a terminal — a full desktop environment with a browser for testing, an IDE for human pair programmers, and GPU acceleration for anything graphical (and hardware video encoding for low latency when a user wants to pair with them over the network). And when you have a team of agents working on different tasks, each one needs its own isolated sandbox so they don’t step on each other’s files, processes, or state.

Helix Desktop gives every agent its own full Linux desktop — running in an isolated container with GPU acceleration. Humans can watch what their agents are doing in real time via H.264 video streams, jump in to collaborate through the same desktop interface, and manage their flock of agents from a mobile phone while on the go. Think of it as giving each agent their own workstation in a virtual office, where you can glance at any screen and tap on it to intervene.

This architecture also enables new human-computer interaction patterns: commentable spec-driven development where a human writes requirements in a Google Docs-style document, agents immediately update their design docs in response to comments, and the human reviews and redirects — all happening concurrently across multiple agent desktops. The agents work in parallel, each in their own sandbox, while the human herds the flock.

The hard technical problem: running 4+ GPU-accelerated desktops simultaneously inside a single QEMU virtual machine on Apple Silicon, sharing one physical GPU, without them deadlocking each other. That’s what this document is about.

The Stack

Browser (WebSocket H.264 client)
    |
Helix Frame Export (VideoToolbox H.264, per-scanout)
    |
QEMU virtio-gpu device model (fence_poll, process_cmdq, scanout management)
    |
virglrenderer (Venus proxy — Vulkan API translation, runs as separate process)
    |
Apple Metal / ParavirtualizedGraphics (actual GPU execution)
    |
Apple M-series GPU silicon

On the guest side:

Container (gnome-shell + Zed IDE + browser)
    |
DRM lease FD (connector + CRTC + planes)
    |
virtio-gpu kernel driver (DMA fences, GEM objects, atomic modesetting)
    |
virtio control queue (1024-entry ring buffer to QEMU)

Layer 1: The Virtio Control Queue

The guest Linux kernel’s virtio_gpu driver communicates with QEMU through a virtio virtqueue — a shared-memory ring buffer. The guest writes command descriptors (create resource, submit 3D command batch, map blob, set scanout, etc.) and kicks the queue. QEMU receives the kick as a vmexit on Apple’s Hypervisor.framework, pops commands from the ring, and processes them.

There are two queues: control (all GPU commands) and cursor (cursor image updates). The control queue is the bottleneck.

Sizing matters. The default queue size is 256 entries for 2D mode, which we increased to 1024 (the virtio maximum) for 3D/GL mode. With 4 gnome-shells each submitting GPU commands continuously, 256 entries fills up. When the ring is full, guest threads block in virtio_gpu_queue_ctrl_sgs — a kernel spinwait that shows up as permanent D-state processes. 1024 entries gives enough headroom.

Command Response Flow

Guest kernel                    QEMU (main thread)
============                    ==================
write cmd to ring
virtqueue_kick() ──vmexit──>    virtio_gpu_handle_ctrl_cb()
                                  qemu_bh_schedule(ctrl_bh)
                                    ...main loop iteration...
                                  virtio_gpu_gl_handle_ctrl()
                                    virtqueue_pop() -- dequeue all pending
                                    QTAILQ_INSERT_TAIL(&cmdq)
                                    virtio_gpu_process_cmdq()
                                      for each cmd in cmdq:
                                        process_cmd(cmd)  -- dispatch
                                        if fenced: move to fenceq
                                        if finished: send response
                                    virtio_gpu_virgl_fence_poll()
                                      virgl_renderer_poll()  -- check GPU
                                      process_cmdq() again
                                      re-arm timer

<──interrupt──                  virtio_notify()
dma_fence_signal()                (response written to reply ring)

The critical thing: every command gets exactly one response. The guest thread that submitted it blocks in the kernel until that response arrives as a virtio interrupt. If QEMU never processes the command, the guest thread blocks forever.

Layer 2: QEMU’s Command Processing Pipeline

QEMU maintains two queues:

cmdq: Commands popped from the virtio ring, waiting to be dispatched to virglrenderer
fenceq: Commands that have been dispatched but are waiting for GPU completion (async)

And one critical counter:

renderer_blocked: A global semaphore. When >0, process_cmdq() refuses to process ANY command from ANY context.

The `renderer_blocked` Problem

renderer_blocked was designed for SPICE’s GL display path. When SPICE blits a frame to the client, it calls graphic_hw_gl_block(true) to pause GPU command processing until the client acknowledges the frame (gl_draw_done). This makes sense for a single display — you don’t want the GPU racing ahead while the display catches up.

But renderer_blocked is global across all scanouts. With 4 gnome-shells, if scanout 1’s SPICE client is slow to acknowledge, ALL four desktops freeze. Worse, blob resource unmaps (Venus uses these heavily for Vulkan memory management) were also incrementing renderer_blocked during their async RCU cleanup phase. With 4 contexts doing overlapping blob unmaps, the counter stayed >0 perpetually.

Fix: We removed renderer_blocked from the blob unmap path entirely. The suspended-command mechanism (cmd_suspended flag + continue in the FOREACH loop) already prevents the specific unmap command from re-executing before RCU completes, without blocking commands from other contexts. We also skip dpy_gl_update entirely on Apple builds (Helix frame export handles frame capture directly, bypassing SPICE).

The `process_cmdq` FIFO Blocking Problem

The original process_cmdq used QTAILQ_FIRST + break when it encountered a suspended command:

// OLD (broken with 4+ contexts):
while (!QTAILQ_EMPTY(&cmdq)) {
    cmd = QTAILQ_FIRST(&cmdq);
    process_cmd(cmd);
    if (cmd_suspended) break;  // STOPS ALL PROCESSING
    ...
}

A single suspended blob unmap from context 1 would block commands from contexts 2, 3, and 4 that are sitting later in the queue.

Fix: Changed to QTAILQ_FOREACH_SAFE with continue — suspended commands stay in the queue but later commands are processed normally.

Layer 3: Fences and the Poll Timer

When a guest submits a GPU command with VIRTIO_GPU_FLAG_FENCE, QEMU dispatches it to virglrenderer and moves it to fenceq. The command stays there until virglrenderer reports that the GPU finished the work.

virglrenderer reports fence completion via a callback (virgl_write_fence), but this callback only fires when QEMU calls virgl_renderer_poll(). And virgl_renderer_poll() only gets called from two places:

handle_ctrl — when the guest kicks the virtqueue (submits new commands)
fence_poll — a periodic timer callback

The fence_poll timer is supposed to fire every 10ms (100 Hz). Each invocation:

Calls virgl_renderer_poll() — asks virglrenderer “any fences done?”
Calls process_cmdq() — processes any queued commands
Re-arms itself for 10ms later

Why the Timer Matters

Without fence_poll, fence completions only get checked when the guest submits new commands (via handle_ctrl). But if the guest is waiting for a fence to complete before submitting the next command, there’s a circular dependency:

Guest: "I'll submit my next command after fence 42 completes"
QEMU:  "I'll check if fence 42 completed when I get the next command"

The timer breaks this cycle by polling independently.

The Virtual Clock Problem

The original code used QEMU_CLOCK_VIRTUAL for the timer. This clock tracks virtual CPU time — it stops advancing when all vCPUs are halted (executing WFI/wait-for-interrupt). When all guest threads are blocked on GPU fences, all vCPUs eventually enter WFI, the virtual clock stops, and fence_poll never fires. The fences never complete, the vCPUs never wake up — permanent deadlock.

Fix: Switch to QEMU_CLOCK_REALTIME which always advances regardless of vCPU state. Also make the timer unconditionally re-arm (the original code only re-armed when there was work to do, but there was a race window between “work arrives” and “timer checks”).

The Mystery: REALTIME Timer Still Doesn’t Fire

After switching to QEMU_CLOCK_REALTIME, fence_poll still shows zero hits in 1-second process samples (782 samples at 1ms intervals). Meanwhile, gui_update — also a REALTIME timer — fires 3-4 times per second from the exact same timerlist_run_timers call path. Both timers are created with timer_new_ms(QEMU_CLOCK_REALTIME, ...) so they should be on the same timerlist. We confirmed via QEMU logs that virtio_gpu_virgl_init runs (twice, due to a guest driver reset/re-init cycle) and reaches the timer_new_ms + timer_mod calls.

The QEMU main loop thread spends 768/782 samples idle in g_poll → __select. During the 14 active samples, 3 go through qemu_clock_run_all_timers → timerlist_run_timers → gui_update. Zero go through fence_poll. All 14 vCPU threads show heavy BQL contention (25-60% of samples in bql_lock_impl).

The QEMU logs show Blocked re-entrant IO on MemoryRegion: virtio-pci-notify-virtio-gpu which means virtio_notify() — called from process_cmdq() → virtio_gpu_ctrl_response() when completing a command — is hitting QEMU’s memory region re-entrancy guard. The guard silently returns MEMTX_ACCESS_ERROR, dropping the guest notification. This could cascade: if the dropped notification means a guest interrupt never fires, the guest thread stays blocked, the vCPU stays in WFI, and the circular dependency persists.

However, this doesn’t explain why the timer itself doesn’t fire. The re-entrancy affects notifications inside process_cmdq, not the timer scheduling. The timer should fire regardless of what happens inside its callback — the callback runs, re-arms via timer_mod, and the main loop picks it up next iteration.

Root cause remains unknown. The difference between gui_update (fires) and fence_poll (doesn’t fire) may be related to when the timer is created: gui_update is created during display initialization before the main loop starts, while fence_poll is created lazily during handle_ctrl (first virtqueue kick) after the main loop is already running. There may be a timer registration race in QEMU’s GLib integration.

The Workaround: Thread-Based Fence Polling

Rather than continuing to debug QEMU’s timer internals, we bypass the timer system entirely with a dedicated thread:

/* Thread function — runs independently of QEMU’s main loop */
static void *fence_poll_thread_fn(void *opaque)
{
    VirtIOGPU *g = opaque;
    VirtIOGPUGL *gl = VIRTIO_GPU_GL(g);

    while (gl->fence_poll_thread_running) {
        g_usleep(10000); /* 10ms = 100 Hz */
        qemu_bh_schedule(gl->fence_poll_bh);
    }
    return NULL;
}

/* BH callback — runs on main loop thread with BQL held */
static void fence_poll_bh_cb(void *opaque)
{
    VirtIOGPU *g = opaque;
    virgl_renderer_poll();
    virtio_gpu_process_cmdq(g);
}

The thread does nothing except sleep 10ms and schedule a bottom-half (BH) on QEMU’s main loop. qemu_bh_schedule() is documented as thread-safe — it writes to an eventfd that wakes the main loop from its g_poll. The BH dispatches on the main thread via aio_ctx_dispatch with BQL held, which is the correct context for virgl_renderer_poll() and process_cmdq().

This is robust because:

g_usleep always works (no dependency on QEMU’s timer system)
qemu_bh_schedule always works (we see BH dispatch in the process samples)
BH dispatch is the same mechanism used for virtio command processing
The original QEMU timer is kept as a secondary fallback — if it ever fires, extra virgl_renderer_poll calls are harmless

Layer 4: virglrenderer and Venus

virglrenderer translates Vulkan API calls from the guest into native Metal API calls on the host. It runs as a separate process (proxy mode) communicating with QEMU over a Unix socket. Each guest GPU context (one per gnome-shell) gets its own virglrenderer thread.

The flow:

Guest Mesa driver makes Vulkan calls
Venus (Vulkan-on-virtio-gpu protocol) serializes them into virtio-gpu SUBMIT_CMD batches
QEMU dispatches batches to virglrenderer via virgl_renderer_submit_cmd()
virglrenderer deserializes and calls Metal/MoltenVK equivalents
When GPU work completes, virglrenderer reports via virgl_write_fence() callback

Venus heavily uses blob resources — guest-visible GPU memory objects. Creating and destroying these involves RESOURCE_CREATE_BLOB and RESOURCE_UNMAP_BLOB commands. The unmap path is particularly tricky because it requires RCU (read-copy-update) synchronization to safely remove memory regions, which is what led to the suspended-command mechanism.

Layer 5: DRM Leases

Each agent’s container needs exclusive access to a virtual GPU output — its own screen, essentially. When a human starts a new agent session, the system needs to dynamically provision a virtual display, hand it to the agent’s container, and start streaming video from it. When the agent’s session ends (or crashes), the display is reclaimed and recycled. This has to work for 15+ concurrent agents on a single machine.

Linux DRM leases provide the isolation primitive: the DRM master can carve off subsets of its resources and hand them to clients as independent DRM file descriptors.

The helix-drm-manager runs as a systemd service on the guest VM:

Opens /dev/dri/card0 as DRM master
Enumerates connectors and CRTCs (virtio-gpu creates 16 virtual outputs)
Listens on a Unix socket for lease requests from containers
For each request:
- Allocates a scanout index (1-15; 0 is the VM console)
- Tells QEMU to enable that scanout (TCP message to frame export server)
- Creates a DRM lease (connector + CRTC + primary plane + cursor plane)
- Sends the lease FD to the container via SCM_RIGHTS
Monitors the connection — when the container dies, automatically revokes the lease and disables the scanout

The mode_config.mutex Deadlock

Two operations in the DRM manager acquired the kernel’s mode_config.mutex:

activateCrtc — DRM_IOCTL_MODE_SETCRTC on the master FD to pre-initialize the CRTC before handing the lease to mutter
reprobeConnector — writing to /sys/class/drm/card0-Virtual-N/status to trigger connector detection

Running gnome-shells also hold mode_config.mutex during atomic page flips (drm_atomic_commit). If a gnome-shell is mid-commit waiting for a GPU fence (which may be stalled due to the fence_poll issue), it holds the mutex indefinitely. The DRM manager trying to set up a new lease blocks on the same mutex, and all other gnome-shells’ page flips cascade-block behind it.

Fix: Removed both activateCrtc and reprobeConnector. QEMU’s enableScanout already triggers the guest hotplug event via dpy_set_ui_info, so the connector appears without explicit reprobe. Mutter can do its own initial modeset through the lease FD now that DRM_CLIENT_CAP_UNIVERSAL_PLANES is set on the master.

Layer 6: Frame Export and Video Streaming

The Helix frame export system (helix-frame-export.m) captures GPU frames directly from QEMU and encodes them as H.264 video:

Capture: When virglrenderer flushes a scanout, QEMU’s virgl_cmd_resource_flush calls into helix frame export. The frame’s Metal texture handle is extracted directly from virglrenderer’s native handle — zero CPU copies.
Blit: The Metal texture is blitted to an IOSurface via EGL/GL. Triple buffering (3 IOSurface slots per scanout) allows VideoToolbox to encode asynchronously without blocking the GPU.
Encode: Apple’s VideoToolbox hardware H.264 encoder compresses each IOSurface. The encode callback fires on a VT thread, which schedules a BH (bottom-half) on QEMU’s main thread to send the encoded frame.
Send: Encoded NAL units are sent to subscribed clients over TCP sockets. Each client subscribes to a specific scanout. Frames are dropped (not queued) if the client’s send buffer is full — this prevents one slow client from affecting others.

The frame export explicitly avoids renderer_blocked / gl_block. The old SPICE GL path used renderer_blocked for backpressure (pause GPU until client acknowledges frame), which is global and causes cross-scanout stalls. Instead, the frame export uses per-slot busy flags — if all 3 IOSurface slots for a scanout are busy with VT encoding, that scanout’s frames are dropped, but other scanouts continue normally.

Why Scaling Matters

A single desktop works fine. Two work fine. The deadlocks only appear at 4+ concurrent desktops — which is exactly the regime we need for production use. A developer working with a team of agents will routinely have 4-8 agents running simultaneously: one refactoring the backend, one writing frontend tests, one investigating a bug, one updating documentation. Each needs a responsive GPU-accelerated desktop. If starting the fourth agent freezes the other three, the product doesn’t work.

Every fix described below was discovered by starting 4 desktops in quick succession and tracing kernel stacks, QEMU process samples, and /proc/interrupts to find exactly where the system seized up. The bugs are all variations of the same theme: mechanisms designed for a single GPU context becoming global bottlenecks when shared across many.

Summary of Fixes

Will we fix it? Stay tuned to find out :-D

Join the Beta when we get it working!

The Debugging Method

Every fix was discovered the same way: start 4 desktops in quick succession, then trace the freeze:

/proc/interrupts — check if GPU interrupt count (virtio1-control) is advancing. If frozen (same count 5 seconds apart), QEMU isn’t sending fence completions to the guest.
cat /proc/*/stack — find D-state processes. gnome-shells stuck in drm_modeset_lock → dma_fence_default_wait means they’re waiting for GPU fences while holding mode_config.mutex. Anything stuck in virtio_gpu_vram_mmap means a synchronous MAP_BLOB is waiting for QEMU to process it.
sample 1 (macOS) — 1-second process sample of QEMU at 1ms intervals. Shows where every thread spends its time. The main loop thread should show fence_poll or process_cmdq hits; if it’s 100% in g_poll, nothing is processing GPU commands.
Kernel hung task messages (serial console) — task X:PID is blocked on a mutex likely owned by task Y:PID directly identifies which process holds the contended lock.
QEMU warnings — Blocked re-entrant IO on MemoryRegion means a virtio_notify was silently dropped, which means a guest never received a response for a command it’s waiting on.

Bringing AI to Where Your Enterprise Lives: Helix + Microsoft 365

Priya Samuel — Tue, 30 Dec 2025 17:23:06 GMT

Most enterprise AI tools miss the obvious: your knowledge is already in SharePoint, and your conversations happen in Teams. Why build another interface?

Helix now integrates natively with both SharePoint and Microsoft Teams—letting you build AI agents that understand your company’s documents and respond directly in the chat tools your teams already use.

The Problem with Enterprise AI Silos

The friction of opening another tab, another tool, another interface kills adoption. Meanwhile, your SharePoint libraries hold thousands of documents that could answer those questions instantly—if only the AI could access them.

Helix supports SharePoint as a RAG Knowledge Source. It connects your document libraries directly to your AI agents’ knowledge bases using Microsoft Graph API. No manual file uploads, no sync scripts, no stale knowledge. The Teams integration completes the interaction loop with the human users, with web-hooks talking to agent’s streaming API endpoints which do RAG over SharePoint data.

How It Works

The SharePoint client handles the realities of enterprise deployments:

- Pagination: Automatically handles large document libraries with Microsoft Graph’s @odata.nextLink

- Recursive traversal: Walks subfolder trees when you need deep document scanning

- Extension filtering: Only index what matters—skip those 50MB PowerPoint template decks

- TLS flexibility: Optional certificate verification bypass for environments with internal Certificate Authorities (yes, we know about your proxy)!

Microsoft Teams: AI Where the Conversations Happen

The Teams integration takes a different approach—instead of pulling data *from* Microsoft, it pushes AI responses *into* Teams conversations.

When a user @mentions your bot:

1. Teams sends the message to Microsoft’s Bot Framework Service

2. Bot Framework POSTs to your Helix deployment’s webhook endpoint

3. Helix processes the message through your configured agent

4. Response routes back through Bot Framework to the Teams client

Conversation Threading

Helix maintains conversation context across message threads. When a user asks a follow-up question, Helix retrieves the existing session and continues the conversation—no “I don’t have context from before” nonsense.

Multi-Tenant Deployments

Here’s where it gets interesting for enterprises: the Azure Bot and Teams app can live in different tenants.

Your Azure Bot registration lives in your IT tenant with all its security controls. But you can deploy the Teams app manifest to a customer or partner tenant—Microsoft’s Bot Framework routes messages regardless of tenant boundaries.

This means:

- Managed Service Providers (MSPs) can build AI agents that serve multiple customer tenants

- Enterprises with multiple O365 tenants can centralise their AI infrastructure

Putting It Together: The Full Pattern

The real power comes from combining both integrations. Consider this scenario:

HR Policy Bot

SharePoint knowledge source indexes your HR policy documents from `https://corporate.sharepoint.com/sites/HR/Policies`
Teams bot installed in company-wide Teams
Employee asks in #general: “@PolicyBot what’s the parental leave policy for adoptions?”
Helix RAG retrieves relevant chunks from the indexed SharePoint documents
AI generates response with citations back to the source documents
Response appears in Teams thread—complete with links to the original SharePoint files

No portal. No context switching. Just answers where people already ask questions.

Security Considerations

Both integrations use OAuth-based authentication:

SharePoint: Uses Microsoft Graph API with Sites.Read.All and Files.Read.All scopes
Teams: Bot Framework handles JWT validation on incoming webhooks
Credentials: App secrets stored encrypted in Helix’s database
Tenant restriction: Optional tenant ID filtering to lock down bot access

What’s Next

Enterprise AI shouldn’t require employees to learn new tools. It should meet them where they already work. With Helix’s Microsoft integrations, your AI agents live inside SharePoint and Teams—The integration runs in the background - employees just get faster answers.

Full setup guides are available in our documentation.

Psst - try Helix Code

If you're interested in AI-powered development tools, check out Helix Code - our upcoming platform for AI-assisted coding.

We Mass-Deployed 15-Year-Old Screen Sharing Technology and It's Actually Better

Luke Marsden — Thu, 18 Dec 2025 17:13:56 GMT

Part 2 of our video streaming saga. Read Part 1: How we replaced WebRTC with WebSockets →

The Year is 2025 and We’re Sending JPEGs

Let me tell you about the time we spent three months building a gorgeous, hardware-accelerated, WebCodecs-powered, 60fps H.264 streaming pipeline over WebSockets...

...and then replaced it with grim | curl when the WiFi got a bit sketchy.

I wish I was joking.

Act I: Hubris (Also Known As “Enterprise Networking Exists”)

We’re building Helix, an AI platform where autonomous coding agents work in cloud sandboxes. Users need to watch their AI assistants work. Think “screen share, but the thing being shared is a robot writing code.”

Last week, we explained how we replaced WebRTC with a custom WebSocket streaming pipeline. This week: why that wasn’t enough.

The constraint that ruined everything: It has to work on enterprise networks.

You know what enterprise networks love? HTTP. HTTPS. Port 443. That’s it. That’s the list.

You know what enterprise networks hate?

UDP — Blocked. Deprioritized. Dropped. “Security risk.”
WebRTC — Requires TURN servers, which requires UDP, which is blocked
Custom ports — Firewall says no
STUN/ICE — NAT traversal? In my corporate network? Absolutely not
Literally anything fun — Denied by policy

We tried WebRTC first. Worked great in dev. Worked great in our cloud. Deployed to an enterprise customer.

“The video doesn’t connect.”

checks network — Outbound UDP blocked. TURN server unreachable. ICE negotiation failing.

We could fight this. Set up TURN servers. Configure enterprise proxies. Work with IT departments.

Or we could accept reality: Everything must go through HTTPS on port 443.

So we built a pure WebSocket video pipeline:

H.264 encoding via GStreamer + VA-API (hardware acceleration, baby)
Binary frames over WebSocket (L7 only, works through any proxy)
WebCodecs API for hardware decoding in the browser
60fps at 40Mbps with sub-100ms latency

We were so proud. We wrote Rust. We wrote TypeScript. We implemented our own binary protocol. We measured things in microseconds.

Then someone tried to use it from a coffee shop.

Act II: Denial

“The video is frozen.”

“Your WiFi is bad.”

“No, the video is definitely frozen. And now my keyboard isn’t working.”

checks the video

It’s showing what the AI was doing 30 seconds ago. And the delay is growing.

Turns out, 40Mbps video streams don’t appreciate 200ms+ network latency. Who knew.

When the network gets congested:

Frames buffer up in the TCP/WebSocket layer
They arrive in-order (thanks TCP!) but increasingly delayed
Video falls further and further behind real-time
You’re watching the AI type code from 45 seconds ago
By the time you see a bug, the AI has already committed it to main
Everything is terrible forever

“Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.

Act III: Bargaining

We tried everything:

“What if we only send keyframes?”

This was our big brain moment. H.264 keyframes (IDR frames) are self-contained. No dependencies on previous frames. Just drop all the P-frames on the server side, send only keyframes, get ~1fps of corruption-free video. Perfect for low-bandwidth fallback!

We added a keyframes_only flag. We modified the video decoder to check FrameType::Idr. We set GOP to 60 (one keyframe per second at 60fps). We tested.

We got exactly ONE frame.

One single, beautiful, 1080p IDR frame. Then silence. Forever.

[WebSocket] Keyframe received (frame 121), sending
[WebSocket] ...
[WebSocket] ...
[WebSocket] It's been 14 seconds why is nothing else coming
[WebSocket] Failed to send audio frame: Closed

checks Wolf logs — encoder still running

checks GStreamer pipeline — frames being produced

checks Moonlight protocol layer — nothing coming through

We’re using Wolf, an excellent open-source game streaming server (seriously, the documentation is great). But our WebSocket streaming layer sits on top of the Moonlight protocol, which is reverse-engineered from NVIDIA GameStream. Somewhere in that protocol stack, something decides that if you’re not consuming P-frames, you’re not ready for more frames. Period.

We poked around for an hour or two, but without diving deep into the Moonlight protocol internals, we weren’t going to fix this. The protocol wanted all its frames, or no frames at all.

“What if we implement proper congestion control?”

looks at TCP congestion control literature

closes tab

“What if we just... don’t have bad WiFi?”

stares at enterprise firewall that’s throttling everything

Act IV: Depression

One late night, while debugging why the stream was frozen again, I opened our screenshot debugging endpoint in a browser tab:

GET /api/v1/external-agents/abc123/screenshot?format=jpeg&quality=70

The image loaded instantly.

A pristine, 150KB JPEG of the remote desktop. Crystal clear. No artifacts. No waiting for keyframes. No decoder state. Just... pixels.

I refreshed. Another instant image.

I mashed F5 like a degenerate. 5 FPS of perfect screenshots.

I looked at my beautiful WebCodecs pipeline. I looked at the JPEGs. I looked at the WebCodecs pipeline again.

No.

No, we are not doing this.

We are professionals. We implement proper video codecs. We don’t spam HTTP requests for individual frames like it’s 2009.

Act V: Acceptance

// Poll screenshots as fast as possible (capped at 10 FPS max)
const fetchScreenshot = async () => {
  const response = await fetch(`/api/v1/external-agents/${sessionId}/screenshot`)
  const blob = await response.blob()
  screenshotImg.src = URL.createObjectURL(blob)
  setTimeout(fetchScreenshot, 100) // yolo
}

We did it. We’re sending JPEGs.

And you know what? It works perfectly.

Why JPEGs Actually Slap

Here’s the thing about our fancy H.264 pipeline:

A JPEG screenshot is self-contained. It either arrives complete, or it doesn’t. There’s no “partial decode.” There’s no “waiting for the next keyframe.” There’s no “decoder state corruption.”

When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.

And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB. We’re sending LESS data per frame AND getting better reliability.

The Hybrid: Have Your Cake and Eat It Too

We didn’t throw away the H.264 pipeline. We’re not complete animals.

Instead, we built adaptive switching:

Good connection (RTT < 150ms): Full 60fps H.264, hardware decoded, buttery smooth
Bad connection detected: Pause video, switch to screenshot polling
Connection recovers: User clicks to retry video

The key insight: we still need the WebSocket for input.

Keyboard and mouse events are tiny. Like, 10 bytes each. The WebSocket handles those perfectly even on a garbage connection. We just needed to stop sending the massive video frames.

So we added one control message:

json

{"set_video_enabled": false}

Server receives this, stops sending video frames. Client polls screenshots instead. Input keeps flowing. Everyone’s happy.

15 lines of Rust. I am not joking.

rust

if !video_enabled.load(Ordering::Relaxed) {
    continue; // skip frame, it's screenshot time baby
}

The Oscillation Problem (Lol)

We almost shipped a hilarious bug.

When you stop sending video frames, the WebSocket becomes basically empty. Just tiny input events and occasional pings.

The latency drops dramatically.

Our adaptive mode sees low latency and thinks: “Oh nice! Connection recovered! Let’s switch back to video!”

Video resumes. 40Mbps floods the connection. Latency spikes. Mode switches to screenshots.

Latency drops. Mode switches to video.

Latency spikes. Mode switches to screenshots.

Forever. Every 2 seconds.

The fix was embarrassingly simple: once you fall back to screenshots, stay there until the user explicitly clicks to retry.

setAdaptiveLockedToScreenshots(true) // no oscillation for you

We show an amber icon and a message: “Video paused to save bandwidth. Click to retry.”

Problem solved. User is in control. No infinite loops.

Ubuntu Doesn’t Ship JPEG Support in grim Because Of Course It Doesn’t

$ grim -t jpeg screenshot.jpg
error: jpeg support disabled

Oh, you thought we were done? Cute.

grim is a Wayland screenshot tool. Perfect for our needs. Supports JPEG output for smaller files.

Except Ubuntu compiles it without libjpeg.

incredible

So now our Dockerfile has a build stage that compiles grim from source:

FROM ubuntu:25.04 AS grim-build
RUN apt-get install -y meson ninja-build libjpeg-turbo8-dev ...
RUN git clone https://git.sr.ht/~emersion/grim && \
    meson setup build -Djpeg=enabled && \
    ninja -C build

We’re building a screenshot tool from source so we can send JPEGs in 2025. This is fine.

The Final Architecture

┌─────────────────────────────────────────────────────────────┐
│                     User's Browser                          │
├─────────────────────────────────────────────────────────────┤
│  WebSocket (always connected)                               │
│  ├── Video frames (H.264) ──────────── when RTT < 150ms     │
│  ├── Input events (keyboard/mouse) ── always                │
│  └── Control messages ─────────────── {"set_video_enabled"} │
│                                                             │
│  HTTP (screenshot polling) ──────────── when RTT > 150ms    │
│  └── GET /screenshot?quality=70                             │
└─────────────────────────────────────────────────────────────┘

Good connection: 60fps H.264, hardware accelerated, beautiful

Bad connection: 2-10fps JPEGs, perfectly reliable, works everywhere

The screenshot quality adapts too:

Frame took >500ms? Drop quality by 10%
Frame took <300ms? Increase quality by 5%
Target: minimum 2 FPS, always

Lessons Learned

Simple solutions often beat complex ones. Three months of H.264 pipeline work. One 2am hacking session the night before production deployment: “what if we just... screenshots?”
Graceful degradation is a feature. Users don’t care about your codec. They care about seeing their screen and typing.
WebSockets are for input, not necessarily video. The input path staying responsive is more important than video frames.
Ubuntu packages are missing random features. Always check. Or just build from source like it’s 2005.
Measure before optimizing. We assumed video streaming was the only option. It wasn’t.

Try It Yourself

Helix is source available: github.com/helixml/helix

The shameful-but-effective screenshot code:

api/cmd/screenshot-server/main.go — 200 lines of Go that changed everything
MoonlightStreamViewer.tsx — React component with adaptive logic
websocket-stream.ts — WebSocket client with setVideoEnabled()

The beautiful H.264 pipeline we’re still proud of:

moonlight-web-stream/ — Rust WebSocket server
Still used when your WiFi doesn’t suck

We’re building Helix, open-source AI infrastructure that works in the real world — even on terrible WiFi. We started by killing WebRTC, then we killed our replacement. Sometimes the 15-year-old solution is the right one.

Want to experience the joy of interacting with an agent desktop at 6 JPEGs a second yourself? Join us for the private beta on Discord:

Join the Private Beta

Star us on GitHub: github.com/helixml/helix

We Killed WebRTC (And Nobody Noticed)

Luke Marsden — Thu, 11 Dec 2025 23:09:26 GMT

At Helix, we run AI coding agents in GPU-accelerated containers. Users watch these agents work through a live video stream—think remote desktop, but for AI. The standard solution for browser-based real-time video is WebRTC.

After months of TURN server hell, we threw it out and replaced it with plain WebSockets.

The result? Lower latency, simpler infrastructure, and it works everywhere.

(Spoiler: This solution worked so well that we eventually threw it away too. But that’s a story for next week.)

The Problem With WebRTC

WebRTC is designed for peer-to-peer video calls. It handles NAT traversal, codec negotiation, adaptive bitrate, and packet loss recovery. It’s an impressive piece of engineering.

But we don’t need peer-to-peer. Our architecture is strictly client-server:

Browser → Proxy → moonlight-web → Wolf (GPU encoder)

For this use case, WebRTC’s complexity becomes pure liability.

TURN Server Hell

Enterprise customers don’t allow random UDP ports. They have L7 load balancers that only speak HTTP/HTTPS on port 443. WebRTC requires:

UDP 3478 (STUN)
TCP 3478 (TURN)
UDP 49152-65535 (media relay)

Getting these through a corporate firewall? Good luck. We spent weeks debugging TURN configurations. coturn, Twilio, custom deployments—each had its own failure modes. The “TCP fallback” that TURN promises? In practice, it’s unreliable and adds 50-100ms of latency.

We had a customer whose WebRTC connections worked 80% of the time. The other 20%? Black screen. No error message. WebRTC’s ICE negotiation would silently fail after 30 seconds of “connecting...”

The Insight

Here’s the thing: WebSockets work everywhere. They’re just HTTP upgrade. Every L7 proxy handles them. CloudFlare, Akamai, nginx, Kubernetes ingress—all work out of the box.

And for real-time video, WebSockets might actually be faster than WebRTC in our architecture:

No jitter buffer - We can render frames immediately
No TURN relay - Direct connection through existing proxy
No ICE negotiation - Connection established in one round-trip

The trade-off is TCP’s head-of-line blocking. But on modern networks with low packet loss? Barely matters.

The Implementation

We stream using the Moonlight protocol—the same tech that powers NVIDIA GameStream. Wolf (our server) encodes video with NVIDIA’s hardware encoder. The browser decodes and displays it.

Previously, our architecture looked like this:

Wolf → Moonlight → [RTP packets] → WebRTC → Browser
                                    ↑
                                 TURN server

Now it’s:

Wolf → Moonlight → [NAL units] → WebSocket → Browser
                                    ↑
                              Your existing HTTPS

Binary Protocol

We defined a minimal binary protocol:

┌────────────┬────────────────────────────────────┐
│ Type (1B)  │ Payload (variable)                 │
└────────────┴────────────────────────────────────┘

Message Types:
  0x01 - Video Frame
  0x02 - Audio Frame
  0x10 - Keyboard Input
  0x11 - Mouse Click
  0x12 - Mouse Position
  0x13 - Mouse Movement

Video frames are raw H264 NAL units—no RTP packetization. Audio is Opus frames. Input goes the other direction.

WebCodecs for Decoding

The browser-side uses WebCodecs API, which landed in Chrome 94 and recently in Firefox 130:

typescript

const decoder = new VideoDecoder({
  output: (frame) => {
    ctx.drawImage(frame, 0, 0)
    frame.close()
  },
  error: console.error,
})

decoder.configure({
  codec: ‘avc1.4d0032’,  // H264 Main Profile
  hardwareAcceleration: ‘prefer-hardware’,
  avc: { format: ‘annexb’ },  // NAL unit format
})

ws.onmessage = (event) => {
  const data = new Uint8Array(event.data)
  if (data[0] === 0x01) {  // Video frame
    decoder.decode(new EncodedVideoChunk({
      type: isKeyframe ? ‘key’ : ‘delta’,
      timestamp: parsePTS(data),
      data: data.slice(HEADER_SIZE),
    }))
  }
}

Hardware-accelerated H264 decoding, straight to canvas. No MediaSource buffering. No jitter buffer. Frame arrives, frame renders.

Audio Sync

Audio uses the same approach with AudioDecoder and AudioContext. We schedule playback based on presentation timestamps:

typescript

const scheduledTime = audioStartTime + (framePTS - basePTS) / 1_000_000
source.start(Math.max(scheduledTime, audioContext.currentTime))

First audio frame establishes the baseline. Subsequent frames are scheduled relative to it. If a frame arrives too late (>100ms behind), we drop it rather than accumulating latency.

Input Forwarding

Input goes the other direction—same WebSocket, same binary format. We reuse the existing Moonlight input protocol:

typescript

sendMouseButton(isDown: boolean, button: number) {
  const buf = new Uint8Array([0x02, isDown ? 1 : 0, button])
  ws.send(new Uint8Array([0x11, ...buf]))  // 0x11 = MouseClick
}

Server parses and forwards to the Moonlight stream, which injects into the Linux input subsystem. Click in browser → click in remote desktop.

What We Lost

Nothing is free. Here’s what WebRTC gave us that we had to handle ourselves:

1. Adaptive Bitrate

WebRTC monitors network conditions and adjusts bitrate automatically. We don’t. Our bitrate is fixed at connection time. For enterprise deployments on stable networks, this is fine. For variable mobile connections, it might be a problem.

2. Packet Loss Recovery

WebRTC uses NACK and PLI to request retransmission of lost packets. With TCP, we get reliable delivery but head-of-line blocking. A lost packet stalls the stream until retransmitted.

In practice? On datacenter-quality networks, packet loss is rare. When it happens, TCP recovers fast enough that users don’t notice.

3. Browser Fallbacks

WebCodecs requires Chrome 94+, Safari 16.4+, or Firefox 130+. Older browsers get nothing. We could add MSE-based fallback, but haven’t needed it—our users are on modern browsers.

What We Gained

Works Everywhere

Literally everywhere. No firewall configuration. No TURN servers. No debugging ICE negotiation. The WebSocket connection just... works.

Simpler Infrastructure

Before:

coturn TURN server (or Twilio, $$$)
STUN server
ICE configuration management
Certificate management for TURN-over-TLS
UDP port ranges

After:

Your existing HTTPS proxy

Lower Latency

Without the jitter buffer and TURN relay, we measured 20-30ms lower end-to-end latency. WebRTC’s adaptive bitrate sometimes caused quality drops that took seconds to recover. Our fixed bitrate is... fixed.

Debuggability

WebRTC failures are famously opaque. “ICE connection failed” tells you nothing. WebSocket failures? You get HTTP status codes, error messages, stack traces. When something breaks, you know why.

Should You Do This?

Probably not, unless:

Your architecture is client-server - Peer-to-peer genuinely needs WebRTC
Your users are behind restrictive firewalls - If TURN works for you, keep using it
You control the encoder - We use Moonlight/Wolf which gives us raw NAL units
Your target browsers support WebCodecs - No IE11 here

But if you’re building real-time video streaming to browsers, and WebRTC’s complexity is killing you, know that there’s another way.

The Code

Both repos are open source:

helix - The frontend + API (TypeScript/React/Go)
moonlight-web-stream - The streaming server (Rust)

The WebSocket streaming code is on the feature/websocket-only-streaming branch. Look for WebSocketStream in the TypeScript and run_websocket_only_mode in the Rust.

We’re building AI coding agents that work in GPU-accelerated containers. If you’re interested in remote development environments, AI pair programming, or just want to see this streaming tech in action, check out helix.ml.

Next week: Why we threw all of this away.

—Luke Marsden, CEO @ Helix

Discussion Questions

Has anyone else replaced WebRTC with WebSockets for real-time video? What was your experience?
We’re considering adding WebTransport as an alternative to WebSockets. Anyone have experience with it in production?
The WebCodecs API is relatively new. Are there edge cases we should watch out for?

Why I've Joined HelixML

Priya Samuel — Thu, 13 Nov 2025 18:23:23 GMT

I’ve joined HelixML as Head of Engineering. I’m genuinely pleased to be working alongside Luke, Chris, and Phil - and a team that values software craftsmanship and building great products just as much as cultivating a thoughtful, positive company culture.

Sometimes the start of a new chapter comes from noticing a pattern you can’t ignore. Over the past year, I kept meeting teams who were excited about what AI could do but were quietly overwhelmed by what it actually took to run it well — the infrastructure, the privacy concerns, and the messy real-world constraints. I found myself increasingly drawn to those conversations, to the space between possibility and practicality, and to the people trying to bridge it with clarity instead of hype.

Kubecon 2025 - A room packed with enthusiasm!

My Background

My background is in MLOps and Identity & Access Management — building trustworthy, scalable AI systems at the intersection of infrastructure, identity, and machine learning. Over the years, I’ve led engineering teams at companies like Elsevier, Dotscience, and ThoughtWorks, helping these organisations bring structure and discipline to complex applications and data platforms. What’s always driven me is creating systems that are both technically sound and human-friendly: clear, open, and built to last.

Helix is building in the open — iterating in public, listening to customers, and treating transparency as a strength.

Why Private GenAI Matters

One thing my career has taught me is that AI itself is rarely the only hard part. The hard part is everything wrapped around it: secure identities, reliable infrastructure, clear deployment patterns, safety checks, evaluation loops, and the long tail of operational detail that turns a clever model into a dependable system. Helix gets that. We’re not just building AI; we’re building the layers that make AI usable, safe, and sustainable inside the boundaries where real organisations operate.

Helix is tackling a problem every enterprise now faces: how to adopt generative AI without giving up control of data, compliance, or infrastructure. With the release of Helix 2.0, teams can deploy production-ready AI agents on their own infrastructure — with real CI/CD, testing, versioning, and observability.

As organisations mature in their AI adoption, private GenAI platforms offer something essential: a reliable, accountable, and transparent path forward.

Looking Ahead

As Head of Engineering, my priorities are simple: build sustainably, deliver tangible outcomes for customers, and share our learnings openly as we go.

Helix represents the kind of company I believe in — transparent, values-driven, and focused on solving real problems with care and craft. I’m excited to help shape its next chapter.

Here’s to building something meaningful — and doing it the right way.

Dynamically presenting MCP clients as tools to LLMs in Go

Thu, 06 Nov 2025 13:16:16 GMT

The most powerful thing (and the weakest part too :)) is how dynamic the agents are. Depending on the scenario, the applications that encapsulate the LLMs must either relax or tighten the grip to get good results.

Today we will take a look at a “relaxed” plumbing that will enable maximum dynamic behavior. We will take in an arbitrary number of remote MCP servers, extract available actions and on the fly convert them to OpenAI tools for our agent to use.

In this article we will look into main components but for the full code you can check our repo:

Store MCP servers
Discover MCP capabilities (fetch tools)
Convert individual MCP actions into tools (MCP schema > OpenAI tool schema)

MCP and OpenAI Tools

I assume you already know about MCPs and OpenAI Tools already but I will just quickly brief you about them regardless.

Tools are presented as name + description + parameters to the LLMs and they are trained/whipped into calling them with non-malformed payloads. They more or less work well, the trick is to keep the schemas relatively simple and provide a few examples.

⚠️ Large schemas (with lots of parameters) can quickly
deteriorate even smartest models. ⚠️

An example tool presentation to the LLM:

import OpenAI from “openai”;
const client = new OpenAI();

const response = await client.responses.create({
    model: “gpt-5”,
    tools: [
        { type: “web_search” },
    ],
    input: “What was a positive news story from today?”,
});

console.log(response.output_text);

An example MCP presentation from the docs:

...
// List resources
const resources = await client.listResources();

// Read a resource
const resource = await client.readResource({
    uri: ‘file:///example.txt’
});

// Call a tool
const result = await client.callTool({
    name: ‘example-tool’,
    arguments: {
        arg1: ‘value’
    }
});
...

However this is not even close to actual usage, it gets quite verbose as you end up using a bunch of SDKs and potentially need to marry a 3rd party framework like langchain to prep the MCP for use within your application.

Keeping track of configured MCP servers

In order to make things work well, my minimal Go struct for MCP configuration came to be:

type ToolMCPClientConfig struct {
	Name          string            
	Description   string            
	Enabled       bool              
	URL           string            
	Headers       map[string]string
	OAuthProvider string            
	Tools []mcp.Tool
}

This ends up stored in Postgres on a slightly larger struct that contains ID, user ID, agent info but it’s not important for this example.

I chose to only support SSE/streaming HTTP options as the socket based servers are probably on the way out due to how limited they are and also it’s just too painful to get them running on the server side. What each field is:

Name - this will be presented as top level tool for the LLM
Description - when to use this tool (if it’s a Postgres MCP), user who is adding this MCP server should try to be descriptive here
URL - where to find the server
Headers and OAuthProvider are optional but we can use them for authentication later on
Tools - read-only, user is not suppose to enter them but we will grab the actions from the server during initial handshake. Let’s touch on this in the next section :)

Initial MCP handshake

When agents are in their agent cycle mode you don’t want to call the MCP server all the time as you would be adding significant latency.

First we construct the client:

...
// Initialize the MCP session
initRequest := mcp.InitializeRequest{
  Params: mcp.InitializeParams{
  ProtocolVersion: mcp.LATEST_PROTOCOL_VERSION,
  Capabilities:    mcp.ClientCapabilities{},
  ClientInfo: mcp.Implementation{
    Name:    “helix-http-client”,
    Version: data.GetHelixVersion(),
    },
  },
}

_, err = mcpClient.Initialize(ctx, initRequest)
if err != nil {
  return nil, err
}
return mcpClient, nil

Then connect, authenticate and list the tools:

import (
  ...
  “github.com/mark3labs/mcp-go/mcp”
)

func InitializeMCPClientSkill(ctx context.Context, clientGetter ClientGetter, meta agent.Meta, oauthManager *oauth.Manager, cfg *types.AssistantMCP) (*types.ToolMCPClientConfig, error) {
  mcpClient, err := clientGetter.NewClient(ctx, meta, oauthManager, cfg)
  if err != nil {
    return nil, err
  }

  // List tools, server description
  toolsResp, err := mcpClient.ListTools(ctx, mcp.ListToolsRequest{})
  if err != nil { 
    return nil, err
  } 
  return &types.ToolMCPClientConfig{
    Name:        cfg.Name,
    Description: cfg.Description,
    Tools:       toolsResp.Tools, 
  }, nil
}

Here the retrieved tools will act as a cache for all the iterations that the agent is going through.

Presenting MCP Tools as OpenAI Tools

First thing that needs to happen for the LLM to be able to use a tool is viewing names, descriptions and parameters.

For each OpenAI tool within helix we define an interface where one of the methods is:

func (t *MCPClientTool) OpenAI() []openai.Tool {
  return []openai.Tool{
    {
      Type: openai.ToolTypeFunction,
      Function: &openai.FunctionDefinition{ 
      Name:        “mcp_” + t.mcpTool.Name, // Duplicate MCPs
      Description: t.mcpTool.Description,
      Parameters:  buildParameters(t.mcpTool.InputSchema),
    },
  },
  }
}

Where buildParameters contains most useful things and can be found here, to get an idea, we have to recursively convert a map[string]any to jsonschema.Definition:

func convertMapToDefinition(data map[string]any) jsonschema.Definition {
  def := jsonschema.Definition{}

  // Handle type - ensure we always have a valid type
  if typeVal, ok := data[”type”].(string); ok && typeVal != “” {
    switch typeVal {
      case “string”:
        def.Type = jsonschema.String
      case “integer”:
        def.Type = jsonschema.Integer
      ...
      ...
   // Handle properties (recursive)
   if props, ok := data[”properties”].(map[string]any); ok {
     properties := make(map[string]jsonschema.Definition)
     for key, prop := range props {
       if propMap, ok := prop.(map[string]any); ok {
	properties[key] = convertMapToDefinition(propMap)
       }
     }
   if len(properties) > 0 {
     def.Properties = properties
     def.Type = jsonschema.Object
   }
   ...
   // Handle items (for arrays)
   if items, ok := data[”items”].(map[string]any); ok {
     itemsDef := convertMapToDefinition(items)
     def.Items = &itemsDef
   }
}

The goal here is to take whatever we have coming from MCP and convert to OpenAI tools format. Both are JSON in the end, however the format is slightly different.

I have noticed that certain MCP tools work with some models but not with others. Good example could be HubSpot MCP not working with Google Gemini models but working OK with OpenAI ones.

Calling the MCP tools from your agent

Thankfully, only the MCP tool parameter conversion is the hard part, once that’s done, actual execution is pretty easy:

import (
  ...
  “github.com/mark3labs/mcp-go/mcp”
)

func (t *MCPClientTool) Execute(ctx context.Context, meta agent.Meta, args map[string]any) (string, error) {
  client, err := t.clientGetter.NewClient(ctx, meta, t.oauthManager, &types.AssistantMCP{
    URL:           t.cfg.URL,
    Headers:       t.cfg.Headers, 
    OAuthProvider: t.cfg.OAuthProvider,
    OAuthScopes:   t.cfg.OAuthScopes,
  })
  if err != nil {
    return “”, err
  }

  req := mcp.CallToolRequest{}
  req.Params.Name = t.mcpTool.Name 
  req.Params.Arguments = args

  res, err := client.CallTool(ctx, mcp.CallToolRequest{
    Params: req.Params,
  })

  if err != nil {...}
  
  var results []string
  for _, content := range res.Content {
    switch content := content.(type) {
      case mcp.TextContent:
        results = append(results, content.Text)
   ...

That’s it, your agent can now talk to any MCP server directly.

Ok, you are ready to face the world

First thing to do is of course connecting your agent to the production database!

Good luck!

Technical Deep Dive on Streaming AI Agent Desktop Sandboxes: When Gaming Protocols Meet Multi-User Access

Luke Marsden — Thu, 30 Oct 2025 17:47:14 GMT

When we started building sandboxes for AI agents at Helix, we wanted to give each agent their own desktop environments that we could stream interactively to users’ browsers. Not just static screenshots - full interactive desktops where agents could browse the web, write code, and use tools, in collaboration with their human colleagues. We looked at VNC, RDP, and various browser-based solutions, but kept coming back to Moonlight.

Moonlight is a game streaming protocol, originally designed to stream PC games to your couch. It’s fast, efficient, and works beautifully over sketchy network connections. There was just one problem: it was built for single-player gaming, and we needed multi-user agent access.

This is the story of how we bent a gaming protocol to our will, and why we’re still working through the consequences.

Why Stream Desktops for AI Agents?

Most AI coding assistants live in your IDE or terminal. But what if your agent needs to actually see the screen, click buttons, navigate UIs? What if you want to watch your agent work in real-time, or collaborate with it in a proper IDE? And what if you want your agent to be able to run on the server while it does all of this, so it can benefit from a good network connection while you open and close your laptop in cafes, on the train, even on the beach?

That’s what we’re building with Helix Code. We run full Linux desktop environments in containers, each with a GPU attached. Inside each desktop runs an AI agent with access to development tools - Claude, code editors, browsers, terminals. Users connect to watch and interact with these agents as they work. And also get a 30,000-foot view of their fleet of agents, because we’re all going to become managers of coding agents whether we like it or not.

The challenge: how do you efficiently stream these GPU-accelerated desktops to browsers and native clients, with low latency, across variable network conditions?

Enter Moonlight (and Wolf)

The Moonlight protocol was originally created by NVIDIA for their GameStream technology. It’s designed to stream high-framerate, low-latency video from a gaming PC to another device. Think playing Cyberpunk on your iPad from your gaming rig upstairs.

We use Wolf, a C++ implementation of the Moonlight server that runs in containers. Wolf exposes the Moonlight protocol, and clients can connect using Moonlight-web in the browser or native Moonlight clients on Mac, Windows, Linux, Android, iOS.

The setup is elegant: Wolf manages Docker containers with GPU attachment, Moonlight handles the video streaming, and we get hardware-accelerated desktop streaming working smoothly over 4G.

There’s just one catch.

The Protocol Mismatch

Moonlight was designed around a simple mental model: one user, streaming one game at a time. You connect, you launch Steam, you play. You can disconnect and reconnect, and your game is still running. But each client gets its own instance.

Here’s where it breaks for us:

Moonlight expects: Each client connects to start their own private game session
We need: Multiple users connecting to the same shared agent session

In Moonlight’s world, if two clients try to start Steam, they each get separate Steam instances. That’s great for gaming - you wouldn’t want your roommate’s controller inputs affecting your game.

But for us, if two people connect to the same AI agent, we don’t want two separate agent instances. We want them both watching and potentially interacting with the same agent doing the same work. The agent has identity and state - it’s logged into services, it has files open, it’s in the middle of tasks.

The semantics just don’t match.

Apps Mode: Our First Workaround

In “apps mode” (standard Moonlight protocol), Wolf creates containers on-demand when the first client connects. This presents another problem: when does the agent actually start?

We want agents to start automatically when users drag tasks onto a Kanban board, or when the system kicks off autonomous work. We can’t wait for someone to connect with a browser before the agent starts running.

Our solution was a bit of a hack: the Helix API pretends to be a Moonlight client.

When Helix starts a new agent session, it makes a WebSocket connection to Moonlight-web, pretending to be a browser. It initiates a “kickoff session” that starts the container and establishes fixed video parameters (4K, 60fps). Then it immediately disconnects.

Now the agent is running, the desktop is up, and real users can connect to it.

But we still have the multi-client problem. If someone connects with an external Moonlight client and starts an agent, they get a completely separate container from the one running in the browser. You end up with multiple “Zed” IDE instances, all thinking they’re the same agent, all trying to stream back, treading on each other’s toes.

Apps mode is stable, but it’s fundamentally single-user.

Lobbies Mode: The Real Solution

Wolf recently added “lobbies mode” - a feature explicitly designed for multiplayer gaming scenarios. Split-screen gaming, multiple controllers, shared screens.

This is exactly what we need.

In lobbies mode:

You start a lobby through Wolf’s API
The container starts immediately (no need for our kickoff hack)
Multiple clients can connect to the same lobby
Everyone sees the same screen
Screen resolution is pre-configured, not determined by the first connecting client

We’re currently migrating to lobbies mode. It solves our fundamental architecture problems:

Multiple users can connect to the same agent
Agents start without any client connection needed
Browser and native clients can connect to the same session
We can delete all the kickoff session complexity

The Current Reality (And Remaining Bugs)

Lobbies mode is still being stabilized. A few weeks ago it had memory leaks and stability issues. The Wolf maintainer has done heroic work making it production-ready, but we’re still ironing out bugs:

Input scaling is broken: When you connect with a different screen resolution than the lobby was configured for, Wolf rescales the video correctly, but mouse coordinates scale wrong. Click where you see a button, hit somewhere else entirely.

Video corruption on some clients: Connecting from Mac sometimes results in corrupted video streams. Still debugging.

Resolution flexibility: In apps mode, each client could negotiate its own optimal resolution. In lobbies mode, we pre-configure the resolution when creating the agent. We let users choose (including “iPhone 15 vertical” because streaming to phones would be cool), but it’s less dynamic.

We’re running apps mode for development right now because it’s stable, even with its limitations. But lobbies mode is the future.

What This Looks Like In Practice

Here’s the architecture:

Helix API: Manages agent sessions, talks to Wolf to create/destroy containers
Moonlight-web: WebRTC adapter that bridges browser clients to Moonlight protocol
Wolf: Moonlight server running in Kubernetes, managing GPU-attached containers
Desktop containers: Sway (Wayland compositor) running on gst-wayland-src, with full desktop environment
External clients: Native Moonlight clients on Mac/Windows/Linux/iOS/Android

The video stream uses WebRTC from browser to Moonlight-web, then Moonlight protocol from there to Wolf. Control signals (connection & encryption setup) flow through websockets. Wolf handles GPU attachment and video encoding. The desktop runs real GUI applications in GPU-accelerated Wayland, not VNC or RDP forwarding.

You can watch an AI agent browse the web, write code in a real IDE, run commands in a real terminal, all streamed to your browser with gaming-grade latency.

Why This Matters

Streaming protocols matter a lot when you’re building visual AI agents. The latency, video quality, and network resilience all affect the user experience. Moonlight gives us:

Low latency: 50-100ms typically, works over 4G
Hardware encoding: GPU-accelerated H.264/H.265
Network resilience: Designed for unreliable wireless
Multi-platform: Works everywhere without custom apps
Mature protocol: Battle-tested by millions of gamers

But we had to work within constraints designed for different semantics. Gaming protocols assume private, single-user sessions. AI agents need shared, multi-user sessions. The impedance mismatch creates real engineering challenges.

What We Learned

Protocol assumptions run deep: Even when a protocol is technically capable of what you need, the assumptions baked into the design can bite you. Moonlight’s one-app-per-client model is fundamental.

Workarounds compound complexity: Our kickoff session hack worked, but added a whole layer of complexity. Sometimes you need to wait for the right feature (lobbies) rather than building around limitations.

Multiplayer gaming has solved this: The gaming community has already solved shared-screen streaming. We just needed to find the right mode and wait for it to stabilize.

Open source saves the day: Wolf’s maintainer added lobbies mode based on real user needs (ours included). Being able to work directly with the developer and contribute back is why we love open source infrastructure.

What’s Next

We’re actively migrating to lobbies mode. Once we fix the input scaling and video corruption bugs, we’ll have proper multi-user agent support. At that point, you’ll be able to:

Connect with native Moonlight clients to watch agents work
Have multiple people viewing the same agent session
Remove all the kickoff session complexity from our codebase
Support mobile clients properly with pre-configured resolutions

If you’re building anything involving desktop streaming, especially for non-gaming use cases, check out Wolf. And if you’re curious about Helix Code or want to try streaming AI agent desktops, join our private beta via our Discord.

Join the Private Beta

Oh, and here’s proof it can stream 4K video way nicer than RDP or VNC!

Is MCP authentication that complicated?

Chris Sterry — Sat, 18 Oct 2025 12:57:18 GMT

MCP started with a basic implementation using local socket to communicate. Nobody really liked it but then when the time came to add authentication lots of vibe coders started to imagine that it’s very complicated. We even saw these kind of presentations:

but we have been doing this forever. We have been implementing auth flows with GitHub, Google, etc. for ages and it’s exactly the same flow. You authenticate the user and this time instead of just using the token to get person’s Google avatar or their GitHub repos we store the token so we can use it for MCP calls.

Time to try MCP auth for ourselves.

How Helix OAuth works

In Helix we first introduced OAuth support to be able to authenticate to third party APIs on behalf of the user. This enables users to automate various tasks that can be done by the agent. It’s a two step process:

Enabling the supported provider in the admin dashboard

Connecting it in the user’s OAuth connections page:

Using HubSpot MCP

One fun thing that I thought of was trying out HubSpot’s MCP. It allows LLMs to query it’s database so you can get information about various deals. You can create a new agent by visiting https://app.helix.ml/new-agent.

There’s an endless path in improving agent’s system prompt but something as simple as this would do the trick:

You are a helpful AI assistant called Helix. Today is {{ .LocalDate }}, local time is {{ .LocalTime }}. You can access Hubspot CRM data through an MCP tools that are provided to you.

Select gpt-4o-mini model.

Then, we open Skills tab and add HubSpot configuration under the MCP tab. Details:

name: hubspot
MCP server URL: https://mcp.hubspot.com/
OAuth Configuration: select the HubSpot

Trying it out

We can go to any other tab that has a preview side panel and try it out:

You can also visit “Usage” tab to view how the agent approached the task:

This tab is instrumental in building reliable agents. You can view all requests, responses, how long they took and how many tokens were consumed.

Next steps:

You can combine this with Helix Tasks that can run agents on a schedule:

Also, feel free to iterate on the system prompt to improve the report structure.

GPU-Accelerated AI Agent Sandboxes: Rethinking How We Interact with Coding Agents

Chris Sterry — Wed, 15 Oct 2025 16:02:47 GMT

I got this working in a coffee shop a few hours ago, and I’m genuinely excited about it. Not because it’s fancy new tech for the sake of it, but because it solves some real pain points I’ve been hitting with AI coding agents.

Let me show you what I mean.

The Problem: Agents Need Better Infrastructure

Here’s where we are with AI coding in 2025: The LLMs themselves are plateauing. We’re not getting exponential intelligence gains anymore - we’re on more of an S-curve where things went up fast and now they’re leveling off. GPT-5 was... fine. Claude 4.5 is quite good. But they’re not going to magically solve all our problems.

This matters because current coding agents still make plenty of mistakes. And when you combine that with how most agents are architected - typically as JavaScript/TypeScript applications running on your laptop - you hit some fundamental limitations:

Performance issues: My main dev machine at home is a 16-core CPU from 2018. It was state of the art back then. Cursor is basically unusable on it. Even Claude Code starts grinding to a halt when you have lots of threads or messages. And I’m not running some ancient potato - this is a machine with plenty of cores.

Limited workflows: Background agents exist, but they’re either clunky separate UIs (looking at you, Cursor’s rushed implementation) or they require your laptop to stay open and connected.

No fleet management: What if you want to manage 5 agents working on different tasks simultaneously? What if you want a 30,000-foot dashboard view of what your agents are doing?

The core insight here is that agents should run on servers, not laptops. When your agent is a long-running server process, you can close your laptop, get on a train with dodgy internet, and your agent keeps working. You can kick off background tasks from Slack. You can manage fleets of agents.

But how do you make that feel as smooth as a local IDE?

Enter: GPU-Accelerated Agent Sandboxes

Here’s what we built: Each agent gets its own dedicated desktop environment running on a GPU. Not a VNC session that feels like molasses. An actual GPU-accelerated Linux desktop that runs at 120fps and responds instantly to keystrokes.

The architecture looks like this:

Helix manages the control plane - You interact with agents through the Helix UI, which handles orchestration, knowledge sources, and conversation history
Each agent spins up a containerized desktop - When you start a coding task, we launch a dedicated environment with Zed (the Rust-based IDE) and your choice of agent (Claude Code, Gemini CLI, or Qwen Code)
Moonlight protocol for streaming - We expose the desktop via Moonlight, which the gaming community built for streaming games from home rigs to phones over 5G. Turns out it works great for streaming IDEs too.

The result? You can work with your agent in the browser, getting full GPU-accelerated rendering. Or you can use the Moonlight client on your phone, tablet, or laptop and get the same smooth experience. The agent keeps running on the server whether you’re connected or not.

Why This Architecture Matters

1. It works with any agent, any LLM

The Zed team created this protocol called ACP (Agent Communication Protocol) that standardizes how agents talk to IDEs. This means we can plug in:

Claude Code (running Anthropic’s models)
Gemini CLI (running Google’s models)
Qwen Code (fully open source, runs entirely on your infrastructure)

We’re not betting on one agent framework or trying to build our own. We’re adopting the best tools the community builds and making them work together.

2. Full context for agents

When you configure knowledge sources, upload PDFs, integrate with Confluence or Jira, or add MCP servers - all of that gets mirrored into the agent’s environment. Your agent has the same context you would, but it’s running in a sandbox.

3. RAG over your entire team’s work

Here’s where it gets interesting: All conversation history from every agent flows back through Helix. That means you can RAG over your team’s coding sessions. Every time someone’s agent solves a problem, that solution becomes searchable for everyone else. It’s like having your whole team’s problem-solving experience in a searchable database.

4. Spec coding by default

I’m a big believer in spec coding as the antidote to “vibe coding.” The idea is simple: Instead of giving your agent vague instructions like “add OAuth support,” you:

Have the agent analyze your codebase and generate a design document
Review the spec as a human (catch the stupid ideas before any code is written)
Only then implement the spec

We’re building spec workflows directly into the infrastructure, including a Kanban board for managing agent tasks. Not for teams of humans - for fleets of agents.

The Technical Details (For Those Who Care)

The gaming community already solved most of the hard problems here. There’s this project called Games on Whales (whales = Docker containers) that lets you run GPU-accelerated gaming in containers using Wayland.

We’re building on top of that foundation:

Wayland desktop: Only uses a few MB of GPU memory, so you can run dozens of these on a single GPU
Moonlight streaming: Battle-tested by gamers streaming over 5G networks
Container isolation: Each agent gets its own filesystem, preventing agents from stepping on each other’s toes
Zed for the IDE: Written entirely in Rust with a custom UI library that renders directly to the GPU. It’s fast. Like, actually fast - not “fast for an Electron app.”

The beauty is that these don’t need fancy GPUs like LLMs do. You can run this on an old laptop with Intel integrated graphics and it works fine. For a production deployment, you can fit ~100 of these instances on a single 16GB GPU.

What This Enables

When agents only need your attention twice an hour instead of constantly, you can have a fundamentally different interaction mode:

Ambient computing: Get a WhatsApp message from your agent when it needs input, respond with a voice note
Fleet management: See all your active agents working on different tasks, with visual thumbnails of what they’re doing
Long-running personal environments: Not just task-based agents, but your daily driver development environment that happens to run in the cloud with GPU acceleration

And here’s the part that gets me most excited: We can use this ourselves to make Helix better. The snake eating its own tail. Our development team using the product we’re building, using it to make itself better, faster and faster.

The Demo

In the video above, you can see:

Spinning up an agent with dedicated desktop environment
The Moonlight connection (complete with PIN for security)
Claude Code 4.5 building a to-do list app in real-time
Updating the branding mid-stream
Smooth, GPU-accelerated UI throughout

The agent has access to a full browser (Firefox), can run commands, and gets all the knowledge sources we configured in Helix.

Now Open for Private Beta

If you want early access:

Join our Discord community and request an invite to be among the first to experience the future of software development.
Join the Private Beta
Connect with me on LinkedIn - linkedin.com/in/luke-marsden-71b3789
Try Helix - Even without the agent sandboxes, Helix is a complete private GenAI stack you can run on your infrastructure. Check it out at helix.ml

We’re especially interested in feedback from teams that:

Run their own GPU infrastructure
Need to keep code and data on-prem
Want to manage fleets of agents working on multiple tasks
Are frustrated with current agent performance

The gaming community figured out how to stream Call of Duty to a phone over 5G. Turns out the same tech makes coding agents feel smooth and responsive. Who knew?

P.S. - If you’re wondering about the project name: My co-founder Phil called this a “massively abstracted distraction” when I first pitched it, hence MAD. We started by calling it the Helix Agentic Development Environment System - HADES. The god of the underworld is also the god of creating wealth from the earth, which feels appropriate for a bootstrapped company building infrastructure. But every time I tell people about it they say “isn’t that hell?” and I have to explain no, everyone goes to the underworld, but I feel like if you’re having that conversation then you’ve already lost, so we decided to be boring and call it Helix Code ;-)

Kodit 0.5: All Things Git

Phil Winder — Fri, 26 Sep 2025 11:42:20 GMT

Yesterday I released Kodit 0.5. As these things often do, This release started off fairly benign. The original intention was to implement features that allowed Kodit to scale to index greater numbers of repositories. However, after attempting to tackle incremental indexing, I quickly realised that we should be mimicking the Git domain more than we currently were.

In code at 0.4 and before, everything was based upon a directory and files within that directory. But after considering that I wanted to index different versions of a repository (like a different tag or a different branch), that quickly became unsustainable. So I took the decision to migrate everything to a git-based domain model and take advantage of the structure.

Breaking Changes Ahoy

Because I’ve changed the domain model, it means that the database schema doesn’t really match. I made the decision to restructure the database, which means that any old data you have in there will get deleted.

I also took the opportunity to remove the auto-indexing command. That was introduced as a stopgap before we had API-based indexing. Since we have API indexing now, this was no longer used, so I removed it.

New Features

With that out of the way, we can now talk about some exciting new features:

The change to the Git domain model. This means that Kodit now has an internal representation of commits, tags, files, and everything else. This not only helps with incremental indexing, which means that you won’t have to reprocess commits. It also means that new commits where nothing much has changed will hardly require any processing at all. This also unlocks the next round of future enhancements we have planned.
Next on the list is LiteLLM integration. The reason for this was that I wanted to incorporate different providers for enrichment and embedding. The simplest way to do that was to use LiteLLM, which supports more than a hundred external embedding providers. I’ve tested it with Helix, Ollama, vLLM, Azure, and OpenAI, but it should work with any provider.
In order to handle increased demand, I’ve completely refactored the indexing pipeline. Now, we have a queue-based system that also has status endpoints so that you can review the status of an indexing operation without having to look at the logs so much.
And finally, there’s probably more to do here. There’s been a bit of refactoring and improvement for the database reads and writes. I found that once we had large numbers of commits, the database read performance was quite slow because of the inefficiency in the way that things were structured. This has improved things, but there’s still more to do.

What’s Next?

Now that we’ve got a good domain model, we have big plans for our next steps. First on the list is a wide range of new enrichments. These new enrichments are based around three key repository use cases: using, developing and reading.

Users of a repository need to know things like the public API and the examples that they can copy from. Developers of a repository need to know the system architecture, the database schema, the layers, and the ways of working. But the readers of a repository want to know the history, the status, a 10km view of the repository as a whole. I’m not entirely sure how this will be exposed to the MCP at this point in time, but I know that it is useful information.

The next step after that is to build a user interface to allow users to view all of this information in a pretty, user-friendly way. People shouldn’t have to browse the API docs to get access to this information.

And finally, it’s still on my mind that I want to index more things. I want to index documentation, I want to index API documentation, I want to index all the things. At the moment, Kodit still only indexes code. And I’m confident that there is more to do in the front-end world as well.

That’s all for now, but of course if you have any ideas or any requests for new features, then please visit the repository. https://github.com/helixml/kodit

Kodit 0.4: Hosting a SaaS, Smarter APIs, and Scaling the Future

Phil Winder — Sat, 09 Aug 2025 11:55:42 GMT

What started as a side-note turned into one of the biggest leaps forward yet.

My vision for Kodit was to help AI coding assistants to search for and provide relevant context from private repositories.

But while marvelling at the traction Context7 has received on Reddit, I realised that there’s much more value up for grabs by indexing public repositories as well. So I planned a small feature in 0.3 to launch a hosted Kodit instance that users can connect to without installing any MCP servers.

It turned out that the act of launching a public service highlighted a variety of scalability challenges in the current implementation. This is fantastic in that it helped me harden Kodit, but it meant that it’s been nearly a month since the last release!

But this does mean that 0.4 is chock-full of juicy features, so let me dive in…

Highlights

Let’s cut to the chase. Here’s an at-a-glance view of what you should take notice of in Kodit 0.4:

Kodit SaaS - Pull in context from public repositories without installing anything
Incremental Indexing - Only changed files are reindexed
Management API - Full REST control over a Kodit server
Streaming HTTP Support - SSE has been deprecated by MCP
Program Slicing - Slightly more sophisticated way of indexing codebases
Cron-based sync schedule & CLI API integration

Getting Started with Kodit SaaS & HTTP Streaming

If you want to see Kodit 0.4 in action, just try it. The hosted version makes it so simple try you almost don’t need any instructions to do it. But just in case, here’s a quick demo.

Browse to the API docs and try using the /search API. Click on the “Try it out” button and paste something like this:

{
  "data": {
    "type": "search",
    "attributes": {
      "text": "an mapper that maps an index domain object to a database object"
    }
  }
}

The results will list the content of the snippet, the relevancy score, a summary of the snippet (which is what you just searched) and some metadata related to where the file can be found. Take a look at the API docs to learn more about how you can use the rest of the API.

Or you can add Kodit to your favourite AI coding assistant by connecting to the public MCP server: https://kodit.helix.ml/mcp

For example, in Claude Code you can execute:

claude mcp add --transport http kodit https://kodit.helix.ml/mcp

Or in Cline, add the following settings:

{
  "mcpServers": {
    "kodit": {
      "autoApprove": [],
      "disabled": false,
      "timeout": 60,
      "type": "streamableHttp",
      "url": "https://kodit.helix.ml/mcp"
    }
  }
}

Instructions for other AI coding assistants are available in the documentation.

Management API and Enterprise Features

Kodit was initially designed to index the private repositories that exist throughout larger organisations. And our design partners suggested a variety of new features that would make it easier to operate Kodit at scale.

The new REST API allows you to remotely manage a Kodit server from afar. Simple key-based authentication adds a rudimentary access control mechanism.

Plugging the CLI into the API allows users to continue to have the same CLI experience even when working with a remote instance.

And a new cron-based scheduler allows Kodit servers to keep indexes up-to-date.

Core Features

Slightly less exciting, but fundamental to the value of Kodit is the algorithm used to index repositories. Previously, Kodit used a simple query-based selection algorithm that basically just pulled out all methods.

The new program slicer takes this a step further and attempts to identify all dependencies of a method. In results you will see relevant imports, dependent functions and even examples of usage. It’s not perfect and quality might differ between languages because of different language implementations, but it’s a lot better than before.

Talking of languages, Kodit now officially supports the following:

python
java
c
c++
rust
go
javascript
c#
html
css

html and css are particularly interesting because they are obviously markup and design languages, not procedural ones. Defining exactly what constitutes a snippet in these languages is hard and I didn’t spend too much time on it. So if you have any suggestions I’d love to hear from you.

Initial Helix Integration

The deployment of the new Kodit SaaS takes one step towards becoming a part of the Helix family. Since you’re reading this on the Helix blog, you probably already know about Helix.

The eventual goal is to have much tighter integration with Helix, but the first and most obvious integration point is to leverage Helix’s on-premise private architecture to provide embeddings and enrichment.

So you’ll be glad to hear that everything that exists within the Kodit database is powered by Helix. No information is shared with or delegated to third party AI services. It’s all running on our own A100’s.

I did, however, start with Kodit’s parallelism set too high and temporarily both saturated the Helix SaaS and locked myself out due to violating rate limits. To fix this I implemented a new dedicated, socket-based API to communicate with directly with Helix and added greater configuration over the parallelism to give standard Helix SaaS users room to breathe.

Closing

Kodit 0.4 is the strongest yet, but it’s still not reached a scale that I’m happy with. To be truly valuable to public users, Kodit must index at least the top 1000 repositories on Github. This is at least 2 orders of magnitude greater than what exists today and I have no doubt there will be challenges achieving that scale. Kodit 0.5 will concentrate on enabling Github-scale.

Together with Helix’s design partners, we’re also thinking towards Kodit 0.6, where we want to expose important information about fixes and features by indexing issues and pull requests. I also want to index documentation too, to unlock the indexing of private enterprise documentation. This is more challenging due to the different systems involved (Github, Gitlab, Azure DevOps, Jira, and so on). But I feel like it’s achievable.

Help Me Help You

Any open source project lives and dies through support. I’d really appreciate it if you give Kodit a try and let me know about your experience. Kodit’s not quite ready for prime-time public adoption yet, purely because of the lack scale, but it will come soon. In the meantime, now is the right time to address any issues.

Also, if you have any burning AI coding needs that are blocking you from doing what you want to do, that’s just the kind of idea that would be helpful to Kodit.

You can reach out to me at phil@helix.ml or start a discussion.

Bootstrapped Private GenAI Startup Hits $1M Annual Revenue, Launches Helix 2.0

Luke Marsden — Thu, 31 Jul 2025 13:40:42 GMT

Luke, Phil and Matt (advisor) at a conference in London

It was late 2023 when we decided to do startup #3. This time, having experienced first-hand the impact that the “ChatGPT moment” had on my consulting clients, I saw the opportunity for a private GenAI stack that you could run on your own infrastructure.

I was fortunate to have saved enough from consulting so that I was able to stop earning for a year and dive headfirst into building product again. Phil and Chris did the same, and we committed to bootstrapping this thing. (If you haven’t read “Reconsider”, I recommend it.)

Since then, we’ve had our share of ups and downs – but having intentionally not taken any external funding, I’m pleased – and significantly relieved – to be able to announce that we’ve just hit the milestone of $1M in annual enterprise revenue.

I can’t speak publicly about who the customers are yet, but suffice to say we are lucky to have some of the most innovative hedge funds, investment managers, and service providers in the world working with us.

Why are they investing in AI Agents on a Private GenAI Stack?

Because AI Agents will (actually) change everything

In the early days of Helix, I had a healthy degree of skepticism that AI might be more hype than substance. But it has been this year, 2025, that has convinced me that we’re on a trajectory for AI Agents to be a true new industrial revolution. What convinced me? It was using the agents in my own work. If you look at Cursor with Claude 4, it’s made me 30-40x more productive than I used to be armed merely with vim. If you look at the hours worth of research you can do with Perplexity in minutes, I’m able to make informed business and technical decisions in a fraction of time it took before.

So is that it? We all hand our data over to these AI SaaS companies, and they take the whole pie? Wait up – there’s a problem lurking in big businesses. With increasing geopolitical strife and threats from security breaches, enterprises are ever-more sensitive to where they send their data, and how they protect their core IP. Turns out, lots of companies are not comfortable sending a lot of their data to OpenAI, or even Microsoft. Private cloud is back. So where’s the private cloud GenAI stack?

Enter Helix 2.0: The Fastest Path to AI Agents on a Private GenAI Stack

You can read more about it on our shiny new website, but the goal here is to be the Macbook Pro of GenAI stacks.

Ever tried using Linux on the desktop? Spent time recompiling your kernel so you can get your webcam working? Wonder why people buy Apple products even though they’re expensive? It’s because they just work, from soup to nuts. That’s the goal with our GenAI stack for running on your own infrastructure.

We give you everything you need to run AI Agents, connected to your data and business systems, either on open source models like Qwen3 running on your own GPUs or proprietary LLMs that you can provision in your VPC, like Claude 4.

Here’s a demo of our latest stuff, check it out:

Wait, you don’t want a sales call with me?

So sometime last year I was talking with my friend John Merrells and he made the excellent point that most people under 40 don’t want a sales call to be able to buy something. So we put tons of effort into making Helix easy to self-serve, with transparent pricing.

It’s not that we don’t want to talk to you – we talk to our customers all the time, pair with them, fly to their offices to run workshops and co-develop agents with them – but instead of being forced to sit through a slide deck, you can deploy Helix yourself in minutes, in Docker, Kubernetes or any major cloud. If you have a question then just hit the chat box in the bottom right of the website (it will connect you to the real team, not an AI). You can evaluate Helix yourself and provision a license through our new self-service Launchpad system where you can deploy:

Agents onto Helix Cloud, our SaaS demo environment - where you get a regular user account
Trial VMs - which give you root access to a VM and full admin access to Helix, although they are configured to talk to external inference providers so are not fully private. We spin these up for you instantly, because we keep a warm pool of them for instant use
GPU Instances - we offer 2x A100 80GB GPU nodes at just $5/hour through our partner Civo, who have been awesome to work with

Everything you can do with Helix, from multi-turn agents integrated with business apps and vision RAG over complex document layouts, can all run on a single A100 GPU, fully private.

Cheers to $1M revenue!

So cheers, here’s to the team who made this happen (including our secret co-founder), folks joining and helping out, the customers who put their trust in us, and all of you crazy AI labs out there building SOTA LLMs – thank you!

Chris and Luke having a well-deserved whisky en route from swampUP in Austin to San Francisco

Check out helix.ml to build AI agents on Helix, deploy to your own infrastructure, and eliminate tedious work in your business.

For a slightly more formal take on the 2.0 release, and more information, check out the press release: Helix 2.0 Gives Global Enterprises the Fastest Path to AI Agents on a Private GenAI Stack

Kodit 0.3: 10x Faster Indexing and Enterprise-Grade New Features

Phil Winder — Fri, 27 Jun 2025 14:12:20 GMT

Kodit, the MCP server that keeps your most important codebases searchable, has just reached its biggest milestone yet. Thanks to community feedback I’ve dramatically improved indexing throughput and delivered a raft of enterprise-focused enhancements.

10× faster indexing: smarter batching + streaming generators
Private Azure DevOps support: zero-config, secrets scrubbed
Pre-filter searches: by language, author, timestamp or repo
Auto-indexing: via environment variables (AI GitOps!)
Slick CLI progress bars: for instant feedback

Read on for the details or skip to the Quick Start and give it a spin.

Improving Performance

This version delivers a major throughput improvement to the indexing process. I started with a GitHub issue that rightly suggested that the indexing UX was poor. So I began by converting all heavy i/o loops to generators to reduce RAM usage by streaming results back to whatever needed it.

On the way I found a massive issue with the way I was batching data for embedding. Batching is required because most embedding APIs (local and remote) support sending batches of embeddings to an endpoint out of the box. But they only support this up to a point. OpenAI, for example, only supports batches of up to 8192 tokens, otherwise you get a HTTP 400 error. That means you need to a) calculate the number of tokens in your data, and b) only batch them up to a point where they fit.

What’s worse, sometimes people like to write massive functions, which means that I have seen snippets longer than 8192, in which case you need to truncate. But because tokens != words, you need to use the tokeniser to figure out where you need to truncate.

I found, however, that I had a while loop iteratively trying to reduce the character count and recalculating the number of tokens that took every character. Stupid, I know. I replaced this with a version that used the raw token array to truncate data in one go. This alone provided a 10x improvement.

After that, and after a brief quest to make the codebase more domain driven, I then implemented an observer pattern to have callbacks to the CLI code to display nice progress bars for all operations. UX win for everyone!

Indexing will still crawl if you try to index large repositories on your laptop using local models. Use an external AI provider like OpenAI or Helix.ML to make it really snappy!

Indexing Private Repositories

I had an important enterprise request to be able to index private Azure DevOps repositories. Thankfully it turned out that the Git URI schema happily accepted personal access tokens and Azure DevOps repositories. The only thing I needed to do was sanitise the URI so that secrets didn’t end up in the database or the logs.

Check out the documentation for more details.

Filtering Searches By X

Another enterprise feature that is also useful to power users, is the ability to pre-filter search results in the MCP or CLI interfaces. Previously, if you had a large number of repositories, it was hard for the agent to find canonical results. There’s a variety of reasons for this, but the main one is that much of the index isn’t relevant to the user’s current workspace. For example, it’s quite likely that the user doesn’t need Java snippets when they are writing a Python application.

So Kodit 0.3 introduces filters that allow you to restrict the search to source, language, author, or timestamp. Of course in most usage, it’s the AI agent that makes this decision, but you can influence what filters it predicts with good prompting.

Auto-Indexing

Aside from improving the deployment documentation, we also had an enterprise request to make it possible to index via configuration; AI GitOps, if you will. I achieved this by exposing some new environmental variables that allow you to specify what gets indexed at configuration time. I call this “auto-indexing.”

In the future I envisage that I might get requests for the ability to specify configuration options per index or even provide an external API to update the index remotely. If you’re interested in any of this, please raise a feature request.

What’s Next?

I have lots more planned for the next milestone. Although I’d love to hear your thoughts. If you have a great idea don’t keep it to yourself. Let me know! I’d love to include it in a future milestone. The next milestone will include the following major features:

better CLI tools to manage indexes
ability to keep indexes synchronised with their source
full MCP protocol coverage to make it easier to use and install (especially streaming HTTP, to get the OAuth support)
a Helix hosted SaaS version of Kodit to make it even easier to get started and open the door to federated indexing

Try Kodit Now

Now’s your chance to try Kodit if you haven’t tried it yet. I think it’s fast becoming the way to ensure your AI coding assistant has the context it needs to work with obscure libraries, private enterprise repositories, or even when you’re working within a microservices architecture!

Try it now and let me know how it goes!

HelixML

Working with the Garage Door Up, Without a Door

I benchmarked two approaches to code indexing for Kodit (which powers Helix Code Intelligence). The smarter one lost.

Why Benchmarking AI Code Tools Is Harder Than You Think

The Problem With Traditional Benchmarks

Modern Coding Benchmarks

Why Kodit is Hard to Benchmark

What Does a Good Benchmark Look Like?

Why Now

What a control room for AI coding agents actually looks like

What a control room for AI coding agents actually looks like

Each agent gets its own computer

The Kanban board

Agents don’t talk to each other (on purpose)

Do the work once, apply it everywhere

The acceleration curve

Try it

Porting a Code RAG system from Python to Go: What the AI got wrong

The Migration Approach

Architectural Decisions

Public API vs Internal

The Snippets Resurrection

Configuration Scattering

In-Memory Pagination

Testing and Validation

Unit Tests

Smoke Tests

API Parity via OpenAPI

Ranking Comparison

Migration Test

What the AI Got Wrong

The New Kodit

What’s Next

Conclusion

How We Forked Zed and Added Remote Control for Agent Fleet Orchestration

What We Needed From the Fork

The WebSocket Sync Protocol

Architecture

Bug 1: The Multi-Message Accumulation Problem

Bug 2: The Completion Hang

Shared Protocol Code: Eliminating Test Drift

Streaming Performance: O(N²) to O(delta)

Bug 3: The UTF-16 Offset

How We Made Docker Builds 193x Faster: From 45 Minutes to 14 Seconds

The Problem

The Architecture

The --load Bottleneck

Smart --load

Three Critical Details

Registry-Accelerated Loading

Benchmarks: 1-line change in top layer of 7.73GB image

Results

Cold start: ~10 minutes (down from 45 minutes)

Warm start: 23 seconds (124x faster)

Incremental changes: ~1 second per image

Compose Build Interception

The Golden Docker Cache: Eliminating Cold Start Entirely

The idea

Why it captures everything

The build is just a startup script run

Per-project, automatic, incremental

The overlayfs false start

The copy approach that actually works

Staleness is handled gracefully

The Full Picture

Cold start: 14 seconds (from 10 minutes, from 45 minutes)

Warm start: 23 seconds (unchanged)

Incremental golden builds: 30s–2 min

Implementation

What We Built

GPU Virtualization Architecture for Multi-Desktop Containers

Overview

Why This Matters

The Stack

Layer 1: The Virtio Control Queue

Command Response Flow

Layer 2: QEMU’s Command Processing Pipeline

The renderer_blocked Problem

The process_cmdq FIFO Blocking Problem

Layer 3: Fences and the Poll Timer

The `--load` Bottleneck

Smart `--load`

The `renderer_blocked` Problem

The `process_cmdq` FIFO Blocking Problem