How well can our AI remediation tool fix insecure code?
That’s the question we set out to answer at Symbiotic Security not in theory, but through structured experimentation. We built and tested our AI-powered remediation engine, then evaluated its output at scale using multiple large language models (LLMs) acting as independent judges. We ran hundreds of remediation scenarios across seven programming languages, applying rigorous evaluation criteria and model benchmarking.
This article details what we built, how we evaluated it, and what we learned about the strengths, limits, and measurable performance of AI-driven remediation.
What we'll cover:
Before diving into our experiments, it’s important to distinguish between two domains of AI remediation:
Of course there are some exceptions that you need to be able to handle, for example when a terraform module is used, you may have to look both at the ressource into the module, but also the way the module is instantiated.
We are working on two different approaches at the same time:
Runnable tests are especially valuable for full-project remediation.
Full-project remediation involves not only fixing the vulnerable code, but also guiding the developer through additional tasks such as performing migrations, writing unit or end-to-end tests, updating documentation, or making changes beyond the vulnerable function itself.When you update the portion of vulnerable code and give a checklist of things to do to the dev to wrap-up, runnable unit tests may not be the best option.
That’s where the LLM-as-a-Judge paradigm came in.
Originally popularized for evaluating chatbots, agents, and other LLM-based systems, LLM-as-a-Judge offers a practical alternative to human reviews especially when evaluating open-ended outputs. It works by prompting another language model (often a different one) to score the output based on predefined criteria.
We adapted this idea to SAST remediation.
The concept is simple: use an LLM to “judge” the remediation of a vulnerable code snippet and its accompanying recommendations based on guidelines you define.
In our case, we ask LLMs to assess the quality of each AI-generated fix across six key dimensions:
The evaluation input included:
To ensure a well-rounded evaluation, each AI-generated remediation is scored across six key dimensions, each rated from 0 to 20. Here’s what we’re assessing:
These dimensions reflect both the technical correctness of the fix and the practical usefulness of the overall remediation recommendations provided to developers.
The LLM then returns a score, label, or even a detailed explanation guided entirely by the evaluation prompt.
This method allows us to systematically evaluate AI remediations even in complex, multi-language projects without requiring runnable code.
To validate the robustness of this “LLM-as-a-Judge” approach, we ran it across:
All models evaluated 190 vulnerable projects across 7 languages
A couple of clarifying keys to read the following results:
Claude-4 consistently gave the lowest scores especially in:
GPT-4.1 showed the least score variance, making it our most consistent evaluator.
GeminiGemini was the most critical around functional execution with the most inconsistent evaluations (σ=20.93)
Now that we had a trusted and consistent evaluation setup, we conducted a series of experiments to measure how different additions to the AI remediation pipeline affect performance.
One of the first improvements we tested was including the project hierarchy (i.e., file and folder structure) in the evaluation prompt. This simple context addition led to notable score improvements ranging from 2 to 4 points across models, particularly in the following dimensions:
These results validate the importance of structured context in helping LLMs better reason about code even when editing a single file. But it also highlights a deeper set of trade-offs we’re actively exploring.
To build on that, we introduced additional techniques to make context more relevant and focused:
These are just a few of the strategies we’re exploring, and they reflect the trade-offs we’re actively managing between :
Another challenge lies in deciding how far to push automation: should the system fix only the vulnerable file and leave the rest to the developer, or aim for full-project, multi-file changes? The former is simpler and safer; the latter is powerful but more complex.
We're experimenting across this spectrum - carefully tuning what context we provide and how we guide the model - without disclosing the exact mechanics. What’s clear is that thoughtful context engineering can move the needle, and we’re continuously refining that balance as the product evolves.
These experiments affirm that:
At first glance, using an LLM to judge the output of another LLM might seem circular or even unreliable. But the key insight is this: we're not asking the model to regenerate the fix - we're asking it to evaluate it. By framing the task differently and using a dedicated evaluation prompt, we tap into a distinct capability of the model - its ability to analyze, classify, and critique text based on structured criteria. This turns the LLM into a focused evaluator rather than a creative generator.
To further improve reliability, we don't rely on a single model. Instead, we run the same evaluation across multiple LLMs, such as GPT-4.1, Claude, and Gemini. Despite their architectural differences, we consistently observed strong inter-model agreement on the quality scores across each dimension we assess. This convergence gives us confidence that the evaluation results are not only scalable, but also trustworthy.
AI-powered remediation is not about chasing automation for its own sake - it's about delivering reliable, developer-centered solutions. Through deliberate scoping, prompt and context engineering, and systematic evaluation, we've found a pragmatic path to meaningful impact: helping developers fix security issues faster, with confidence.
With a trusted evaluation framework in place, we're now able to measure progress, test new ideas, and make informed improvements to the remediation pipeline. There's more to explore—and we're just getting started.