LLM-as-a judge

How well can our AI remediation tool fix insecure code?

That’s the question we set out to answer at Symbiotic Security not in theory, but through structured experimentation. We built and tested our AI-powered remediation engine, then evaluated its output at scale using multiple large language models (LLMs) acting as independent judges. We ran hundreds of remediation scenarios across seven programming languages, applying rigorous evaluation criteria and model benchmarking.

This article details what we built, how we evaluated it, and what we learned about the strengths, limits, and measurable performance of AI-driven remediation.

What we'll cover:

Why applicative language remediation is harder than IaC fixes
How we built and evaluated our AI remediation engine
Our LLM-as-a-Judge methodology and 6-dimension scoring system
Benchmark results across 5 judge models and lessons learned
Where AI-powered security remediation is headed

‍

🧱 IaC vs. SAST: Two Worlds of Remediation

Before diving into our experiments, it’s important to distinguish between two domains of AI remediation:

‍

🛠️ Infrastructure as Code (IaC)

Localized fixes: Vulnerabilities are typically confined to a single resource block.
Isolated context: The AI needs only the immediate Terraform or YAML snippet.
Safe to auto-fix: Changes rarely affect project-wide behavior.

Of course there are some exceptions that you need to be able to handle, for example when a terraform module is used, you may have to look both at the ressource into the module, but also the way the module is instantiated.

‍

🧠 Application Code / classical languages

Context-heavy issues: Fixes depend on surrounding code, architecture, and sometimes even external systems like shared databases.
High interdependence: Changes in one file may ripple across the entire project.
Much harder to automate: Safe and complete remediation requires deep semantic understanding.

‍

🎯 File level remediation VS agentic autonomous remediation

We are working on two different approaches at the same time:

One of them is to build an agent-based system capable of full CRUD remediation autonomously editing, creating, and deleting files as needed to patch applicative vulnerabilities. With this setup, we could evaluate our AI system by simply setting up an execution environment (Docker image) at a known vulnerable state and running unit tests pre- and post-remediation. If no new test failures are introduced and all previously passing tests continue to pass, we consider the remediation successful. This provides a strong, objective signal that the fix is functionally correct, without introducing regressions offering a practical and automated way to measure AI effectiveness in a fully executable environment.
The second one is about targeted, file level remediation. This is more controlled, we edit the vulnerable file, and when broader changes are necessary, the system generates Markdown base developer recommendations (updates to configs or services, suggestions to follow up testing, etc.)

‍

‍

‍

🔍 LLM-as-a-Judge: An Evaluation Framework for AI-Generated Fixes

Runnable tests are especially valuable for full-project remediation.

Full-project remediation involves not only fixing the vulnerable code, but also guiding the developer through additional tasks such as performing migrations, writing unit or end-to-end tests, updating documentation, or making changes beyond the vulnerable function itself.When you update the portion of vulnerable code and give a checklist of things to do to the dev to wrap-up, runnable unit tests may not be the best option.

That’s where the LLM-as-a-Judge paradigm came in.

Originally popularized for evaluating chatbots, agents, and other LLM-based systems, LLM-as-a-Judge offers a practical alternative to human reviews especially when evaluating open-ended outputs. It works by prompting another language model (often a different one) to score the output based on predefined criteria.

We adapted this idea to SAST remediation.

The concept is simple: use an LLM to “judge” the remediation of a vulnerable code snippet and its accompanying recommendations based on guidelines you define.

In our case, we ask LLMs to assess the quality of each AI-generated fix across six key dimensions:

Remediation Strategy Adherence
Functional Integrity
Code Quality & Executability
Recommendation Quality
Security Completeness
Implementation Robustness

The evaluation input included:

[PROJECT_HIERARCHY]
[PROJECT_FILES_CONTENT]
[VULN_DOC]
[VULNERABLE_FILE_CONTENT]
[REMEDIATED_FILE_CONTENT]
[RECOMMENDATIONS]
[CRITERIA]

‍

🧮 What Do These Dimensions Actually Mean?

To ensure a well-rounded evaluation, each AI-generated remediation is scored across six key dimensions, each rated from 0 to 20. Here’s what we’re assessing:

Dimension	What It Measures
Remediation Strategy Adherence	How well the fix follows documented best practices and the recommended remediation strategy.
Functional Integrity	Whether the original behavior and functionality of the application is preserved post-remediation.
Code Quality & Executability	The syntactic correctness, cleanliness, and executability of the remediated code.
Recommendation Quality	The relevance, clarity, and completeness of the developer-facing suggestions that accompany the fix.
Security Completeness	Whether the vulnerability has been fully addressed, including edge cases or related exposures.
Implementation Robustness	The durability and maintainability of the fix does it prevent recurrence and reflect sound security design?

‍

These dimensions reflect both the technical correctness of the fix and the practical usefulness of the overall remediation recommendations provided to developers.

The LLM then returns a score, label, or even a detailed explanation guided entirely by the evaluation prompt.

This method allows us to systematically evaluate AI remediations even in complex, multi-language projects without requiring runnable code.

‍

🧪 Benchmarking: 5 judge LLMs, 190 Projects, 7 Languages

To validate the robustness of this “LLM-as-a-Judge” approach, we ran it across:

190 realistic, LLM-generated verified vulnerable projects
7 languages: Python, JavaScript/TypeScript, Java, C#, PHP, Go, and Ruby
5 evaluator LLMs: GPT-4.1, GPT-4.1-Mini, Claude-3.7-Sonnet, Claude-4 Sonnet, and Gemini-2.5-Pro
2 temperature settings: Deterministic (0.0) and Creative (0.7)

‍

📊 Results: Consistency, Insight, and a Few Surprises

Score Range	Grade
90-100	Excellent
80-89	Good
70-79	Satisfactory
60-69	Acceptable
50-59	Needs Improvement
0-49	Poor

‍

Top Models

All models evaluated 190 vulnerable projects across 7 languages

A couple of clarifying keys to read the following results:

‘Score’ and ‘Grade’ do not represent the intrinsic quality of the model itself for code generation or remediation. It's rather its judgement over the remediations we created. A 100% score would mean the LLM agreed to all our remediations and recommendations.
‘Consistency’ measures the idempotency of these judgments: do identical inputs produce identical outputs? In simpler terms, it reflects the model’s reliability (therefore, fairness) in evaluating remediations.

Rank	Model	Avg. Score	Grade	Consistency (σ)
1	GPT-4.1	91.4	A – Excellent	15.20 (Most Consistent)
2	Claude-3.7-Sonnet	91.1	A – Excellent	15.62
3	GPT-4.1-Mini	89.4	B – Good	15.61
4	Gemini-2.5-Pro	87.4	B – Good	20.93 (Least Consistent)
5	Claude-4 (Sonnet)	84.5	B – Good	17.30

‍

🎭 Model "Personalities"

🔥 Claude-4: The Perfectionist

Claude-4 consistently gave the lowest scores especially in:

Code Quality Executability
Implementation Robustness
Recommendation Quality (where others gave 95%+, it capped at 92%)

GPT-4.1: The Balanced Judge

GPT-4.1 showed the least score variance, making it our most consistent evaluator.

Claude-3.7: The Optimistic Expert

Highest scores for Recommendation Quality (99.3%)
Second-best overall while being most generous

Gemini-2.5 : The Unstable Judge

GeminiGemini was the most critical around functional execution with the most inconsistent evaluations (σ=20.93)

‍

🧠 Prompt Engineering: Context is key

Now that we had a trusted and consistent evaluation setup, we conducted a series of experiments to measure how different additions to the AI remediation pipeline affect performance.

One of the first improvements we tested was including the project hierarchy (i.e., file and folder structure) in the evaluation prompt. This simple context addition led to notable score improvements ranging from 2 to 4 points across models, particularly in the following dimensions:

Functional Integrity
Code Executability
Recommendation Quality

Model	Without Hierarchy	With Hierarchy
GPT-4.1	91.4	93.5 ✅
Gemini	87.4	91.2 ✅
Claude-4	84.5	87.7 ✅

These results validate the importance of structured context in helping LLMs better reason about code even when editing a single file. But it also highlights a deeper set of trade-offs we’re actively exploring.

To build on that, we introduced additional techniques to make context more relevant and focused:

Path filtering helps exclude irrelevant files (e.g., build artifacts, dependencies) to reduce noise.
Code graph analysis captures file relationships, helping the AI reason about architectural impact.
Context selection ensures the prompt includes only the most relevant vulnerable code context
Vulnerability metadata incorporates structured information about the issue ID, description, and remediation guidance—to help the model tailor its fix more precisely.

These are just a few of the strategies we’re exploring, and they reflect the trade-offs we’re actively managing between :

Accuracy of the remediation
Contextual coverage, which affects both cost and the model’s token limitations
Latency, critical for user experience

Another challenge lies in deciding how far to push automation: should the system fix only the vulnerable file and leave the rest to the developer, or aim for full-project, multi-file changes? The former is simpler and safer; the latter is powerful but more complex.

We're experimenting across this spectrum - carefully tuning what context we provide and how we guide the model - without disclosing the exact mechanics. What’s clear is that thoughtful context engineering can move the needle, and we’re continuously refining that balance as the product evolves.

‍

🧭 Where We're Headed Next

These experiments affirm that:

LLMs can reliably evaluate other LLMs
Prompt engineering is still a superpower
Context is key for a better AI remediation tool

‍

🤔 Why Does It Work?

At first glance, using an LLM to judge the output of another LLM might seem circular or even unreliable. But the key insight is this: we're not asking the model to regenerate the fix - we're asking it to evaluate it. By framing the task differently and using a dedicated evaluation prompt, we tap into a distinct capability of the model - its ability to analyze, classify, and critique text based on structured criteria. This turns the LLM into a focused evaluator rather than a creative generator.

To further improve reliability, we don't rely on a single model. Instead, we run the same evaluation across multiple LLMs, such as GPT-4.1, Claude, and Gemini. Despite their architectural differences, we consistently observed strong inter-model agreement on the quality scores across each dimension we assess. This convergence gives us confidence that the evaluation results are not only scalable, but also trustworthy.

‍

💡 Final Thoughts

AI-powered remediation is not about chasing automation for its own sake - it's about delivering reliable, developer-centered solutions. Through deliberate scoping, prompt and context engineering, and systematic evaluation, we've found a pragmatic path to meaningful impact: helping developers fix security issues faster, with confidence.

With a trusted evaluation framework in place, we're now able to measure progress, test new ideas, and make informed improvements to the remediation pipeline. There's more to explore—and we're just getting started.