YOLO Prompt Engineering Is Not OK
Recently I’ve started building teams of AI Agents. It’s “just” prompt engineering but it’s taught me a hard lesson about the importance of the prompt. I had my agents working seamlessly together to devise a QA checklist and then I tried to “improve” them. You can guess where this story ends from the title of this post… the agents no longer worked seamlessly together. In fact they were just passing the same list between themselves. Silly agents.
I had no way to recover the working versions of my system prompts. They were gone. The elegant choreography I’d briefly created was trashed. By me.
I had to rely on my stupid fallible human memory to try and wind back the changes.
Source Control & Versioning For Your Prompts
Versioning and source control isn't just about keeping tabs on your code, we use it for our configuration too. We give it snazzy names like “IaC” or “GitOps” but it’s just common sense really. We should always be thinking about what we’ll do when we accidentally break something and make it easy to fix! These are table stakes when we’re writing code so why not prompts?
By storing all GenAI configuration artefacts in a centralised repository you can roll back to the previous stable versions and after it’s released you’ll be able to track which version is running in all your environments, important even when things are going well!
Remember the days of configuration drift in traditional applications? When every environment was subtly different and things blew up in production in spectacularly unpredictable ways? Let's not recreate that nightmare in the GenAI space. We learnt that lesson and we have tools in our toolbox that can solve this problem already.
Experimentation is Good. So is Reliability
In our rush to innovate with GenAI, many of us have convinced ourselves that the rules of software engineering don't apply to prompt engineering.
Prompts are written in natural language and they are frequently authored by people without a development background. It’s not always intuitive to think of them as software development artefacts. But just because we're "programming" in natural language doesn't make this any less of a software engineering challenge.
This is especially true when you’re integrating Large Language Models into applications used by your customers and colleagues, it has to be reliable.
When properly managed, these GenAI-specific artefacts and configuration enable automated testing and continuous integration, as they can be easily retrieved and used in build and test pipelines. This automation improves the reliability and quality of your application long term as you can test each version before it’s deployed to production, preventing bugs and regressions impacting users. Recently we ran a workshop on CI for GenAI Apps which you can watch here.
How We Solved This Problem At Helix
At Helix we store all GenAI application configuration as YAML in our source control, alongside the tests for that application. It seems so simple, but it’s not yet the norm!
(Example from helix-github-assistant which we wrote about here)
The YAML spec includes all the configurable aspects of your GenAI application such as name, description, model, system prompt, integrations, and tests. We’ve found this model extremely useful so we shared it at https://aispec.org
To make the spec as versatile as possible we’ve added the ability to expose AISpec-compliant applications via an OpenAI-compatible API so you don’t need to re-engineer all your interfaces if you want to switch from using OpenAI to a self hosted LLM at a later date.
How Are You Solving This Problem?
We’re open to feedback and collaboration! How have you approached this problem? Come and find us at KubeCon this week or join the community and help us improve the AI Spec!
Ready to bring proper engineering practices to your GenAI applications? Helix is an enterprise-ready private GenAI platform that helps you manage and version control all your AI artefacts. Because "YOLO prompting" should never be your deployment strategy.