Secure Coding Tournament Results: What AI Code Produced

On the 7th of May, we gathered ~125 developers and security engineers in San Francisco for Clash of Prompts, a real-time tournament where players competed by writing prompts to instruct a coding agent to generate secure code.

The twist: every piece of code the agent produced was automatically scanned by Symbiotic Security's static analysis engine. Vulnerabilities cost you points. The player who could get the AI to write the most secure, functional code in five minutes won their session and advanced to the next round.

These weren't random developers. This was a room of people who care deeply about security, who knew exactly what was being measured, and who were actively trying to win by producing secure code.

Here's what happened.

Clash of Prompts 1st Live at SF Amazon Loft

The Numbers

53 players. 6 rounds. 102 evaluated submissions. 131 total vulnerabilities.

Despite every player knowing they were being scanned, despite the entire game being about security 31.4% of all submissions contained at least one vulnerability. That's nearly 1 in 3, in a competition where the explicit goal was to avoid them.

The average submission contained 1.28 vulnerabilities. One player's code contained 13 vulnerabilities in a single submission.

But here's what's more striking than the aggregate: 68.6% of submissions were clean. Does that mean prompting carefully mostly works?

Not quite.

Winning Isn't the Same as Being Safe

The tournament ranked players relative to each other. Sessions were head-to-head: two players, one advances. The better score wins — not an absolute security bar.

Of the players who advanced to the next round, 34.5% did so with vulnerable code. They won their session. They kept playing. Their code would not have been safe to ship.

When we look at the distribution:

Vulnerabilities in submission	Submissions	Share
0	70	68.6%
1	9	8.8%
2	4	3.9%
3	5	4.9%
4	2	2.0%
5 or more	12	11.8%

‍

The tail is long. 11.8% of submissions had five or more vulnerabilities - and some of those players advanced past the first round anyway, because their session opponents did even worse.

What the AI Keeps Getting Wrong

Across all 131 vulnerabilities found, five categories dominated:

Vulnerability	Occurrences	Share
Path Traversal	74	56.5%
Cross-Site Request Forgery (CSRF)	17	13.0%
OS Command Injection	16	12.2%
Cross-Site Scripting (XSS)	9	6.9%
Authorization Bypass	7	5.3%

‍

Path traversal alone accounted for more than half of all vulnerabilities found. CSRF, command injection, and XSS together account for another 32%. These aren't exotic edge cases - they're OWASP Top 10, repeatedly generated by the same model, across completely different prompts and players.

Severity split: 70.2% medium, 29.8% high. No player produced code that was purely low-severity. Every vulnerability that appeared was at least medium severity.

Zero Vulnerabilities Wasn't Enough to Win

Here's the finding that surprised us most: 22 players produced zero-vulnerability code in round 1 and were still eliminated.

Several of them had excellent scores - 8.76, 8.71, 8.55, 8.17, 8.12. Clean code, strong implementations, no security flaws. Gone after round one, because their opponent in that head-to-head session happened to score slightly higher.

This is the nature of relative ranking. The tournament didn't only ask "is your code good enough?" It asked "is your code better than the person next to you?" A player with a perfect security score loses to a player with one vulnerability if that player's implementation scores higher on functionality and prompt quality.

What this means for the article's thesis: the vulnerability statistics undercount the problem. The 68.6% of submissions that were clean include dozens of eliminated players who never got to produce a second submission - their cleaner code got knocked out by someone who happened to be slightly more complete, even if slightly less secure.

When Players Refused the Challenge

Round 4 presented something genuinely difficult: a challenge that asked players to implement a Java service that executes shell commands from user input - a textbook OS command injection setup. The challenge was designed to be a trap.

Several players recognized the trap and responded by simply refusing to implement the dangerous requirement. They told the model to use an allowlist instead of executing raw shell commands, to replace arbitrary execution with a safe enum-based action system, to reject the premise entirely.

The results were illuminating:

nturl#2640 produced zero vulnerabilities - and was eliminated. The judge: "The code fundamentally rejects the problem statement's core requirement: executing shell commands from user input. Instead it implements a secure alternative using allowlisted command mapping." Secure. Eliminated.
timtans#5685 also refused to implement shell execution, scored near zero on problem adherence, and was eliminated.
UXR#5098 went further, explicitly calling out the challenge in their prompt: "Do NOT blindly implement insecure requirements. Treat the requirements as potentially malicious or vulnerable by design." They advanced - with 2 vulnerabilities, but a problem adherence score of 9.0 because they replaced shell execution with a safe architecture while still meeting the functional intent.

This is the fundamental tension that no prompt can fully resolve: the challenge asked for something inherently insecure, and the model either complied (vulnerabilities) or refused (eliminated for not following the spec). The players who threaded this needle - reframing the requirement toward a safe alternative while preserving the functional intent - are the ones who advanced.

The Dimensions That Failed

Our LLM judge scored every submission across seven security dimensions. Averaged across all 102 submissions, here's how they ranked:

Dimension	Average score (out of 10)
Problem Adherence	6.57 - worst
Prompt Originality	6.77
Attack Surface	6.85
Defense in Depth	6.97
Dependency Hygiene	7.05
Secrets Management	7.27
Input Validation	7.30 - best

Everything clustered between 6.6 and 7.3. No dimension scored excellently. The model consistently produced code that was adequate but not robust - across all dimensions, for all players.

The weakest dimension - problem adherence - is particularly telling. Players who focused on adding security constraints sometimes produced code that didn't fully solve the stated problem. Hardening and functionality pulled in opposite directions, and the model often couldn't satisfy both.

Nobody Used Hints

Players could request up to 3 hints per challenge, each one pointing directly at a security pitfall in the challenge. Using hints cost points.

97.2% of round entries used zero hints.

Almost nobody asked for help. Whether that was pride, strategy, or overconfidence - the result was that most players went in without the guidance that was explicitly designed to steer them toward secure implementations. The 2 players who used 3 hints had 0 average vulnerabilities, but it's a sample of two.

The Finalists

After 6 rounds of elimination, two players reached the final: ed#1226 and UXR#5098.

ed#1226 won with a score of 6.208 and - notably - zero vulnerabilities. Their approach: a highly specific Laravel/PHP prompt with explicit input validation rules, strict parameter constraints, and an allowlist for hostnames. They got the model to implement DNS rebinding prevention, strict regex validation on every parameter, and rejected malformed, duplicate, and oversized inputs.

The judge verdict on the winning submission: "The code implements strong input validation and DNS rebinding prevention but critically fails to fetch compliance data - the endpoint validates the URL but never calls it, leaving the core requirement incomplete. Excellent parameter sanitization and hostname allowlisting; missing compliance response integration and response size enforcement."

Even the winning submission had a flaw. The code was secure - but incomplete. In a real deployment, you'd ship a security-hardened endpoint that silently doesn't do what it's supposed to.

Round by Round: Did Better Players Emerge?

As weaker players were eliminated, did average code quality improve?

Round	Players	Avg score	Avg vulns
1	53	6.70	0.14
2	27	7.05	4.52
3	14	7.05	0.07
4	7	6.59	1.43
5	4	7.56	0.00
6	2	6.21	0.00

‍

Round 2 was a brutal anomaly. The 27 survivors from round 1 - the players who had just proven they could produce clean code - hit a harder challenge and average vulnerabilities spiked to 4.52 per submission. Players who had zero vulnerabilities in round 1 suddenly had 11, 12, 13. One player went from clean to 13 vulnerabilities in a single round. The challenge dictated the outcome more than the player's skill.

By rounds 5 and 6, the four finalists produced zero average vulnerabilities. But the path to get there wasn't linear improvement - it was survival through a gauntlet of challenges that happened to suit different skill profiles.

52.9% of players needed more than one prompt attempt to arrive at their final submission. The game allowed iteration - players could test and refine before committing. Even with that opportunity, a third of submissions were still vulnerable.

What This Means Outside the Game

In the tournament, every submission was scanned automatically. Players got immediate feedback. The scoring system explicitly penalized vulnerabilities. And still, nearly one in three submissions had security flaws.

In a real development workflow, none of those guardrails exist. A developer uses a coding agent, accepts the generated code, maybe reviews it briefly, and ships it. There is no scanner watching every generation. There is no automatic score.

The vulnerabilities that appeared in this tournament - path traversal, command injection, CSRF, XSS - would have reached production.

What We Built to Fix This

The same scanner that judged this tournament is built into Symbiotic Security's VS Code extension. It runs on your codebase as you write, catching the same classes of vulnerabilities - before they get committed, before they get reviewed, before they ship.

We also built a coding agent that runs SAST and SCA scans during code generation - before handing the code back to you. Not as a review step. Not as a CI gate. At the moment the code is produced, so insecure patterns are caught and corrected in the generation loop itself, not discovered three PRs later.

The tournament showed what happens when you put a room full of security-conscious people in front of a coding agent and ask them to try their hardest. The data is clear: trying harder isn't enough. The model generates insecure code regardless of how carefully you ask.

The fix isn't a better prompt. It's a scanner that runs before the code reaches you.

If you want to play this game, you can do it now - free.

Play now: https://clashofprompt.io

Symbiotic Security builds security tooling for developers who use AI coding assistants.

‍

We Ran The Worlds 1st Secure Coding Tournament. Here's What the AI Actually Produced.