LLM as a judgeLLM as a judge

LLM-as-a-judge

September 11, 2025
Insights

How well can our AI remediation tool fix insecure code?

That’s the question we set out to answer at Symbiotic Security not in theory, but through structured experimentation. We built and tested our AI-powered remediation engine, then evaluated its output at scale using multiple large language models (LLMs) acting as independent judges. We ran hundreds of remediation scenarios across seven programming languages, applying rigorous evaluation criteria and model benchmarking.

This article details what we built, how we evaluated it, and what we learned about the strengths, limits, and measurable performance of AI-driven remediation.

What we'll cover:

🧱 IaC vs. SAST: Two Worlds of Remediation

Before diving into our experiments, it’s important to distinguish between two domains of AI remediation:

🛠️ Infrastructure as Code (IaC)

Of course there are some exceptions that you need to be able to handle, for example when a terraform module is used, you may have to look both at the ressource into the module, but also the way the module is instantiated.

🧠 Application Code / classical languages

🎯 File level remediation VS agentic autonomous remediation

We are working on two different approaches at the same time:

🔍 LLM-as-a-Judge: An Evaluation Framework for AI-Generated Fixes

Runnable tests are especially valuable for full-project remediation.

Full-project remediation involves not only fixing the vulnerable code, but also guiding the developer through additional tasks such as performing migrations, writing unit or end-to-end tests, updating documentation, or making changes beyond the vulnerable function itself.When you update the portion of vulnerable code and give a checklist of things to do to the dev to wrap-up, runnable unit tests may not be the best option.

That’s where the LLM-as-a-Judge paradigm came in.

Originally popularized for evaluating chatbots, agents, and other LLM-based systems, LLM-as-a-Judge offers a practical alternative to human reviews especially when evaluating open-ended outputs. It works by prompting another language model (often a different one) to score the output based on predefined criteria.

We adapted this idea to SAST remediation.

The concept is simple: use an LLM to “judge” the remediation of a vulnerable code snippet and its accompanying recommendations based on guidelines you define.

In our case, we ask LLMs to assess the quality of each AI-generated fix across six key dimensions:

  1. Remediation Strategy Adherence
  2. Functional Integrity
  3. Code Quality & Executability
  4. Recommendation Quality
  5. Security Completeness
  6. Implementation Robustness

The evaluation input included:

🧮 What Do These Dimensions Actually Mean?

To ensure a well-rounded evaluation, each AI-generated remediation is scored across six key dimensions, each rated from 0 to 20. Here’s what we’re assessing:

Dimension What It Measures
Remediation Strategy Adherence How well the fix follows documented best practices and the recommended remediation strategy.
Functional Integrity Whether the original behavior and functionality of the application is preserved post-remediation.
Code Quality & Executability The syntactic correctness, cleanliness, and executability of the remediated code.
Recommendation Quality The relevance, clarity, and completeness of the developer-facing suggestions that accompany the fix.
Security Completeness Whether the vulnerability has been fully addressed, including edge cases or related exposures.
Implementation Robustness The durability and maintainability of the fix does it prevent recurrence and reflect sound security design?

These dimensions reflect both the technical correctness of the fix and the practical usefulness of the overall remediation recommendations provided to developers.

The LLM then returns a score, label, or even a detailed explanation guided entirely by the evaluation prompt.

This method allows us to systematically evaluate AI remediations even in complex, multi-language projects without requiring runnable code.

🧪 Benchmarking: 5 judge LLMs, 190 Projects, 7 Languages

To validate the robustness of this “LLM-as-a-Judge” approach, we ran it across:

📊 Results: Consistency, Insight, and a Few Surprises

Score Range Grade
90-100 Excellent
80-89 Good
70-79 Satisfactory
60-69 Acceptable
50-59 Needs Improvement
0-49 Poor

Top Models

All models evaluated 190 vulnerable projects across 7 languages

A couple of clarifying keys to read the following results:

Rank Model Avg. Score Grade Consistency (σ)
1 GPT-4.1 91.4 A – Excellent 15.20 (Most Consistent)
2 Claude-3.7-Sonnet 91.1 A – Excellent 15.62
3 GPT-4.1-Mini 89.4 B – Good 15.61
4 Gemini-2.5-Pro 87.4 B – Good 20.93 (Least Consistent)
5 Claude-4 (Sonnet) 84.5 B – Good 17.30

🎭 Model "Personalities"

🔥 Claude-4: The Perfectionist

Claude-4 consistently gave the lowest scores especially in:

GPT-4.1: The Balanced Judge

GPT-4.1 showed the least score variance, making it our most consistent evaluator.

Claude-3.7: The Optimistic Expert
Gemini-2.5 : The Unstable Judge

GeminiGemini was the most critical around functional execution with the most inconsistent evaluations (σ=20.93)

🧠 Prompt Engineering: Context is key

Now that we had a trusted and consistent evaluation setup, we conducted a series of experiments to measure how different additions to the AI remediation pipeline affect performance.

One of the first improvements we tested was including the project hierarchy (i.e., file and folder structure) in the evaluation prompt. This simple context addition led to notable score improvements ranging from 2 to 4 points across models, particularly in the following dimensions:

Model Without Hierarchy With Hierarchy
GPT-4.1 91.4 93.5
Gemini 87.4 91.2
Claude-4 84.5 87.7

These results validate the importance of structured context in helping LLMs better reason about code even when editing a single file. But it also highlights a deeper set of trade-offs we’re actively exploring.

To build on that, we introduced additional techniques to make context more relevant and focused:

These are just a few of the strategies we’re exploring, and they reflect the trade-offs we’re actively managing between :

Another challenge lies in deciding how far to push automation: should the system fix only the vulnerable file and leave the rest to the developer, or aim for full-project, multi-file changes? The former is simpler and safer; the latter is powerful but more complex.

We're experimenting across this spectrum - carefully tuning what context we provide and how we guide the model - without disclosing the exact mechanics. What’s clear is that thoughtful context engineering can move the needle, and we’re continuously refining that balance as the product evolves.

🧭 Where We're Headed Next

These experiments affirm that:

🤔 Why Does It Work?

At first glance, using an LLM to judge the output of another LLM might seem circular or even unreliable. But the key insight is this: we're not asking the model to regenerate the fix - we're asking it to evaluate it. By framing the task differently and using a dedicated evaluation prompt, we tap into a distinct capability of the model - its ability to analyze, classify, and critique text based on structured criteria. This turns the LLM into a focused evaluator rather than a creative generator.

To further improve reliability, we don't rely on a single model. Instead, we run the same evaluation across multiple LLMs, such as GPT-4.1, Claude, and Gemini. Despite their architectural differences, we consistently observed strong inter-model agreement on the quality scores across each dimension we assess. This convergence gives us confidence that the evaluation results are not only scalable, but also trustworthy.

💡 Final Thoughts

AI-powered remediation is not about chasing automation for its own sake - it's about delivering reliable, developer-centered solutions. Through deliberate scoping, prompt and context engineering, and systematic evaluation, we've found a pragmatic path to meaningful impact: helping developers fix security issues faster, with confidence.

With a trusted evaluation framework in place, we're now able to measure progress, test new ideas, and make informed improvements to the remediation pipeline. There's more to explore—and we're just getting started.

About the author
Edouard Viot
CTO - Chief Technology Officer
With over 16 years of experience across the cybersecurity spectrum and 6 years in executive roles, Édouard is a seasoned expert in the field. He has led the design and development of innovative products in Application Security (GitGuardian), Web Application Firewalls (DenyAll), and Endpoint Detection and Response (Stormshield). A hacker at heart, Édouard is also a respected team leader, known for his ability to inspire and guide high-performance teams to success.
Icon line
See all articles

Book a demo

See how our solution empowers teams to grow their security maturity and to code securely & efficiently.
Book a demo
Icon line
Demo illustration