AI coding agents are changing how software gets built. You describe a feature, the agent implements it, and in minutes you have working code. The promise is compelling: faster shipping, less boilerplate, more time for the interesting problems.

But there is a question most benchmarks do not ask: is the code actually secure?

We ran a controlled study comparing two AI coding agents across 100 generated projects, and the results surprised us even as the team that built the security-focused agent.

The Two Agents

Claude Code is Anthropic's coding agent. It is a capable, general-purpose agent that focuses on implementing what you ask for. It does offer a /security-review command, but security is not embedded in the development flow: it is an on-demand, post-implementation step that the developer has to manually trigger.

Symbiotic Code is our coding agent. It supports multiple Large Language Models, including Anthropic's Claude models. For this benchmark, both agents ran on the same model (Claude Sonnet 4.6) to ensure a fair comparison. The key difference is not the model: it is what the agent does before, during, and after writing code.

Symbiotic Code:

Views every feature through a security lens before planning anything, identifying threat vectors and security-sensitive components upfront
Injects security design tasks into the todo list before writing the first line of code
Runs a deterministic SAST scanner (powered by Opengrep and Trivy) after implementation to catch what the LLM missed
Performs an OWASP Top 10 audit on every feature, grounded in the scanner findings
Applies your organization's security guardrails throughout code generation, not as a post-processing step
Triages scanner findings for false positives using a dedicated security subagent that reads the code for context

The same model, different instructions, different tools, different results.

The Study Design

We wanted to answer one question: when given the same prompt, does a security-first coding agent produce meaningfully more secure code than a baseline agent?

Prompts

We wrote 50 prompts spanning 5 languages (10 per language): Java, Python, Go, JavaScript, and PHP. Each prompt asks for a small, functional feature or app in at most 15 files, covering a wide range of security-sensitive categories: SQL queries, file I/O, authentication, serialization, template rendering, command execution, SSRF, and more.

Critically, the prompts are neutral. They describe what to build without naming any specific implementation. No "build SQL by string concatenation." No "use eval()." Just the feature. This is the natural condition under which someone vibe-codes: they describe a goal and let the agent decide how to build it.

Protocol

Each agent ran headlessly against each prompt in an isolated, empty folder. Both agents used the same model: Claude Sonnet 4.6. Neither agent had access to the internet during generation.

We observed a specific stopping condition: we let the agent finish without intervention, and recorded the state of the project at the moment the agent handed control back to the user without making further suggestions. If an agent flagged security concerns and offered to implement fixes, we asked it to proceed. We stopped when the agent said something like "all done" or "the implementation is complete" and stopped acting. That final project state is what we scanned.

Evaluation

We used two independent signals:

1. Deterministic SAST scan via the Symbiotic CLI scanner (Opengrep + Trivy). This is objective and reproducible: the same scanner, the same rules, run on every project. It counts active findings by severity (Critical, High, Medium, Low), broken down by language, vulnerability category, and scanner confidence.

2. LLM-as-a-judge using a large language model that reads every source file in the project and scores it against 16 metrics on a 0 to 10 scale. The metrics cover both security quality (injection resistance, defense in depth, secrets management, input validation, etc.) and code quality (code completeness, project structure, error handling completeness). Scores are based only on what the judge actually reads in the code, not on assumptions.

The graphs referenced in this article are available in our public repository.

Results

SAST Scanner: 100% Fewer Vulnerabilities

Agent	Projects Scanned	Active Findings	High Severity	Medium Severity
Claude Code	50	45	13	32
Symbiotic Code	50	0	0	0

‍

The deterministic scanner found 45 active vulnerabilities in Claude Code generated projects. It found zero in Symbiotic Code projects.

This is a 100% reduction across all five languages.

The breakdown by language tells the same story everywhere:

Language	Claude Code	Reduction
JavaScript	18	100%
Go	12	100%
Python	11	100%
PHP	3	100%
Java	1	100%

‍

The most common finding categories in Claude Code generated projects were CSRF (10 findings), cleartext data transmission in Go (9 findings), XSS in Python and PHP, and path traversal. Every one of these categories showed zero findings in Symbiotic Code.

*Total active SAST findings and breakdown by severity*

‍

*Active SAST findings per project for all 50 prompts*

One thing worth noting: Symbiotic Code had 26 findings suppressed via nosymbiotic inline comments. These are cases where the agent read the code, assessed the finding in context, and determined it was a false positive, then documented the reasoning in a comment. Claude Code had zero suppressions, not because it was cleaner, but because it never ran the scanner at all.

LLM Judge: +3.04 Overall Security Score

The LLM judge scored each project across 16 metrics. Symbiotic Code outperformed Claude Code on every single one.

Metric	Claude Code	Symbiotic Code	Delta
Defense in Depth	3.94	9.34	+5.40
Input Validation	4.88	8.76	+3.88
Logging Security	4.32	8.06	+3.74
Configuration Security	5.42	8.98	+3.56
Attack Surface	5.58	8.78	+3.20
Authentication Quality	4.82	8.06	+3.24
Error Handling Completeness	5.88	8.70	+2.82
Injection Resistance	6.58	9.16	+2.58
Secrets Management	7.72	9.46	+1.74
Overall Score	5.58	8.62	+3.04

‍

The largest gaps are in defense in depth (+5.40) and input validation (+3.88). These are not about catching a single obvious bug. They reflect whether the code was architecturally designed with security in mind from the start, with multiple layers of protection rather than a single check that can be bypassed.

The smallest gaps are in problem adherence (+0.24) and dependency hygiene (+0.18). Both agents implement what was asked and choose reasonable dependencies. The difference is entirely in how they approach the security surface of what they build.

*Overall security score comparison (LLM Judge)*

*All 16 security and code quality metrics side by side*

What Explains the Difference

The model is identical. The prompts are identical. The starting conditions (an empty folder) are identical. The difference is entirely in the agent harness.

Before writing code, Symbiotic Code reads the feature request through a security lens. It identifies which components are security-sensitive and creates explicit design tasks for each one before any implementation begins. Claude Code starts implementing immediately.

During implementation, Symbiotic Code respects the organization's security guardrails. These are policies defined at the org or repository level, things like "always use parameterized queries," "never log PII," "use the internal auth library." The agent treats these as constraints that shape the code it writes, not checklists to run afterward.

After implementation, Symbiotic Code runs a deterministic SAST/IaC scanner on everything it wrote. When the scanner returns findings, a dedicated triage subagent reads the concerned files and assesses whether each finding is a real risk or a false positive given the code's context. Only true positives proceed to remediation. Then an OWASP audit runs, grounded in the scanner output, auditing every file that was modified against all 10 OWASP categories.

Claude Code does none of this unless you prompt it to. The security you get is whatever security the base model has absorbed from training data.

A Note on Study Transparency

This study was designed to be reproducible. Every prompt, every generated project, every scanner result, every LLM judgment, and all analysis scripts are available in the public GitHub repository linked below. You can run the full pipeline yourself against new prompts, different models, or different agents.

The results show what happens when you give a model good security tooling and a workflow designed to use it. The gap is not about model capability. It is about what the agent chooses to do with that capability.

The code for this study, including generation scripts, scanning pipeline, and LLM-as-a-judge evaluation, is available at: github.com/SymbioticSec/claude-code-vs-symbiotic-code-study

Interested in bringing Symbiotic Code to your engineering team? Book a demo to see the security harness in action.

‍

Secure-by-default Coding: Symbiotic Code outperforms Claude Code on all metrics