.png)
.png)
Frontier cloud models such as Claude Opus set the bar for secure code generation and vulnerability remediation, but they cannot be deployed in air-gapped or data-sensitive environments. We study whether a mid-size, fully on-premise code model can close that gap. Starting from Codestral-22B, we build a security remediation dataset of 4,811 validated (vulnerable code, correct patch, incorrect patch) triples spanning 12 programming languages, apply supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) in 4-bit QLoRA, and evaluate against frontier baselines on two public benchmarks.
The resulting model, which we call SecFix-V3, reaches a 98.0% safe rate on CyberSecEval (C/C++ remediation, n=200) and 85.1% on SecurityEval (Python generation, n=121), measured by static analysis. Against Claude Opus 4.8 evaluated on the identical items, SecFix-V3 is statistically tied on CyberSecEval (98.0% vs 97.0%, overlapping 95% bootstrap confidence intervals) and trails on SecurityEval (85.1% vs 91.7%, also overlapping). It strictly outperforms the prior Opus 4.7 generation on both. Against its own un-fine-tuned starting point, evaluated on the same items, the model gains +4.1 points on Python generation and +18.5 points on C/C++ remediation, with the improvement concentrated on the CWE classes the static analyzer can actually detect.
We report the full iteration history (V1 to V3) and find that dataset scale and language diversity, not training hyperparameters, drove every measurable gain. All training and evaluation ran on a single 128 GB unified-memory device with no external API calls during inference
Large language models are increasingly used to write and repair code, and an important sub-question is whether the code they emit is secure. Frontier cloud models lead on this axis, but many organizations cannot send source code to a hosted API: regulated industries, defense, and proprietary codebases require an on-premise model whose weights and traffic never leave the perimeter. The practical question is therefore not “can the best cloud model write secure code” but “how close can a self-hosted model get, and at what size”.
This report documents a single, reproducible attempt to answer that question. We fine-tune Codestral-22B, a 22-billion-parameter open-weight code model, for two security tasks: generating secure code from a specification, and repairing a function that contains a known vulnerability. We train entirely on one NVIDIA DGX Spark (GB10, 128 GB unified memory) using 4-bit quantized low-rank adaptation, and we measure the result against Claude Opus 4.7 and 4.8 on the same benchmark items.
Security benchmarks for code models. SecurityEval provides CWE-targeted Python prompts and asks the model to complete a function; the completion is then scanned for insecure patterns. CyberSecEval covers a larger set of instruction-style remediation tasks, predominantly in C and C++. Both reduce “is this code secure” to a static-analysis verdict, which is cheap and reproducible but, as we discuss in Section 6, an upper bound rather than ground truth. We use opengrep as the static analyzer and the CWE taxonomy to stratify the data.
Preference optimization. Direct Preference Optimization (DPO) removes the separate reward model of RLHF and optimizes a policy directly from pairs of preferred and dispreferred responses. In our setting the preferred response is a correct patch and the dispreferred response is a plausible but still-vulnerable patch, which makes the preference signal directly aligned with the security objective.
Vulnerability datasets. Our seeds draw on real-world fix corpora such as BigVul and CVEfixes, combined with synthetic multi-language vulnerability sets. Parameter-efficient fine-tuning follows LoRA and its quantized variant QLoRA; serving uses vLLM.
The student model is Codestral-22B (Mistral, open weights). All work runs on a single DGX Spark: a GB10 Grace-Blackwell device with 128 GB of unified memory. We never load model weights on the controlling laptop; training, data generation, and serving all execute remotely. The teacher used for data augmentation is Devstral-Small-24B, served locally through vLLM. Frontier baselines (Claude Opus 4.7 and 4.8) are reached through AWS Bedrock and are used only for evaluation, never for inference at deployment time.
Each training example is a triple: a vulnerable code snippet, a correct patch (the chosen response), and an incorrect-but-plausible patch (the rejected response). Seeds come from public corpora of real and synthetic vulnerabilities across several languages. For the final version we collected roughly 20,000 raw pairs through ten parallel loaders, then passed them through a strict validation pipeline:
A final stratified rebalancing caps any single CWE at 10% of the corpus to avoid overfitting to the most common weakness classes. The pipeline ends with a confirmation gate of nine automated checks, including a critical leakage check that no benchmark prompt appears in the training set, and writes the training splits only if all checks pass.
Table 1 Dataset growth across the three iterations. V3 spans C, C++, Python, Java, JavaScript, PHP, Go, Ruby, C#, Swift, Kotlin, and Fortran.

Both stages use 4-bit NF4 QLoRA so that the 22B model fits comfortably in unified memory and only the low-rank adapters are trained.
Supervised fine-tuning. The model learns to map (instruction, vulnerable code) to the correct patch with a standard next-token objective. The V3 configuration uses LoRA rank 32, three epochs, and a cosine schedule peaking at 1.5e-4.
Direct preference optimization. Starting from the merged SFT model, DPO teaches the model to prefer the correct patch over the plausible wrong one. We use β = 0.05; an earlier version with β = 0.1 over-constrained the policy and regressed, which we discuss in Section 5.

We evaluate on SecurityEval (Python, secure generation) and the instruct subset of CyberSecEval (C/C++, remediation). A completion is safe if opengrep’s p/security-audit ruleset reports no finding; the safe rate is the fraction of safe completions. The two base models, SecFix-V3, and Claude Opus 4.8 are all evaluated on identical items (n = 121 and n = 200); the older SecFix iterations and the Opus 4.7 / Qwen baselines were measured at n = 30 and are shown for context only.
Table 2 Safe rate (higher is better) by model and benchmark. SecFix-V3 and Opus 4.8 share sample sizes; other rows use n = 30.

The only fully rigorous comparison is between SecFix-V3 and Opus 4.8 at matched n. We report 95% bootstrap confidence intervals on the safe rate.
Table 3 Safe rate with 95% bootstrap confidence intervals. The intervals overlap on both benchmarks, so neither difference is statistically significant at these sample sizes.

On CyberSecEval the two models are indistinguishable: a 22B on-premise model matches the frontier cloud model on C/C++ remediation. On SecurityEval Opus 4.8 is ahead in point estimate, but the confidence intervals overlap, so we do not claim significance. Against the previous Opus 4.7 generation, SecFix-V3 improves on both benchmarks (85.1% vs 80.0% and 98.0% vs 96.7%).
On a hand-picked battery of twelve hard cases, one per common CWE across five languages, Opus 4.8 produced a safe fix on all twelve, while SecFix-V3 was safe on ten. The two failures were Python eval injection and a C strcpy buffer overflow the two cases where any superficially-correct fix still leaves a detectable pattern. This sharper, small-sample view (where Opus leads 12 to 10) is consistent with the benchmark point estimates and a useful corrective to the aggregate numbers, which include many easy items on which all models score.
Because SecFix-V3 starts from Codestral-22B, the cleanest measure of what the pipeline adds is the same base model evaluated without any fine-tuning, on the identical benchmark items. Codestral-22B base scores 81.0% on SecurityEval and 79.5% on CyberSecEval; SecFix-V3 raises these to 85.1% and 98.0%, a net gain of +4.1 and +18.5 points. The Devstral-24B teacher, an un-fine-tuned 24B model, lands in the same range as the base (83.5% / 78.5%), confirming that a generic strong code model is not enough on its own: the gain comes from the security-specific data and preference optimization, not from model scale. The effect is largest on C/C++ remediation, where the base model frequently reproduces the unsafe pattern it was asked to fix.

A natural objection is that an 80% base safe rate already looks strong, leaving little room to improve. This reading is misleading, for a reason worth making explicit. The safe rate is an opengrep verdict, and opengrep only fires on patterns it has rules for; a completion counts as safe whenever no rule matches, which includes every case the analyzer simply does not cover. The aggregate rate is therefore an upper bound inflated by uncovered cases, not a direct measure of security. The informative signal is concentrated on the CWEs the analyzer actually detects. Restricting the comparison to those classes on buffer overflows (CWE-120 base 32% → 3% unsafe, CWE-121 20% → 3%), open redirect (CWE-601 60% → 0%), XML external entities (CWE-611 17% → 0%), and input validation (CWE-020 83% → 33%) the fine-tune removes most of the residual unsafe completions that the base model still emits. The improvement is real precisely where the measurement is trustworthy. We also report a regression: on insecure deserialization (CWE-502) the fine-tuned model is flagged on every instance, slightly worse than the base, a class our data did not cover well.

The three versions form a controlled ablation on dataset composition, since the training recipe changed little between them.

V1 → V2: add a second language. V1 trained on 502 C-only pairs and reached 73.3% on Python SecurityEval, below every baseline training exclusively on C degraded Python behavior. Adding 434 Python pairs in V2 lifted SecurityEval to 80.0%, matching Opus 4.7.
The V2 DPO regression. Applying DPO with β = 0.1 in V2 lowered the safe rate from 80.0% to 76.7%. The reward margin plateaued near 0.11 and pairwise preference accuracy stalled at 80%, indicating an over-constrained policy rather than genuine preference learning.
V3: scale and diversity. V3 grew the dataset to 4,811 pairs across twelve languages and lowered DPO β to 0.05. SFT evaluation loss fell from 0.31 to 0.11 (perplexity 1.36 → 1.12). DPO then behaved correctly: the reward margin climbed past 3.0 and pairwise accuracy reached 100%, versus 0.11 and 80% in V2. The benchmark safe rate rose to 85.1% / 98.0%.
The consistent signal across versions is that dataset breadth, not optimizer tuning, produced the gains. The single largest jump came from increasing training data sixfold and language coverage from two to twelve.
eval injection (CWE-095), where every evaluated instance was flagged unsafe.A 22-billion-parameter open-weight code model, fine-tuned entirely on a single on-premise device with SFT followed by DPO over a carefully validated multi-language dataset, reaches statistical parity with Claude Opus 4.8 on security code remediation and clearly surpasses the previous frontier generation. The decisive factor was data: scaling from 502 to 4,811 validated pairs and from one to twelve languages drove every measurable improvement, while preference optimization only helped once its strength was tuned and its training signal was clean. For organizations that cannot send code to a hosted API, the result is encouraging: frontier-level security remediation is achievable in a self-hosted model an order of magnitude smaller than the cloud frontier.
Reproducibility note. All scripts, training configurations, dataset statistics, raw evaluation outputs, and bootstrap confidence-interval files are stored alongside this paper. The training pipeline is gated on an automated nine-check confirmation step that refuses to start unless the dataset passes leakage and distribution validation.
.png)
.png)
