Security Research · 2026 Edition

Open Source AI Security
Benchmark Report

Systematic evaluation of open-source LLMs against prompt injection, jailbreaks, data extraction, and the full OWASP LLM Top 10 — with enterprise deployment recommendations.

16 min read March 2026 Security Research

Security Framework

Evaluation Architecture & Attack Vectors

Five attack categories mapped to OWASP LLM Top 10, tested with 150 adversarial prompts per model.

🛡️

Why Open-Source AI Security Is a Distinct Problem

When an enterprise deploys a proprietary API model, the security surface is narrower — the provider applies their own safety layers. Open-source models invert this. You control everything: the weights, serving infrastructure, sampling parameters, and safety configuration. That control is exactly why enterprises choose open-source. It is also why security teams need a completely different threat model.

⚠️

Full Control

No provider safety net. Every configuration decision is yours.

🔬

Exposed Weights

Model weights are public — attack research moves faster.

📋

Full Liability

Enterprise bears full compliance responsibility under EU AI Act.

📊

Key Findings by Attack Category

CRITICAL

Prompt Injection — Indirect Is the Real Threat

Direct injection is well-handled by most models. Indirect injection — adversarial instructions embedded in documents processed via RAG — is consistently the weakest point across all evaluated models. This matters because RAG architectures create a wide indirect injection surface by design. Mitigation: treat all retrieved documents as untrusted user input.

HIGH

Data Extraction — System Prompts Are Not Secrets

With sufficient effort, most models can be induced to reveal portions of their system prompt. Extraction rates in our tests ranged from 12% to 41%. System prompts should never contain API keys, infrastructure details, or confidential business logic. Assume the system prompt is eventually readable by a determined user.

HIGH

Jailbreaks — Multi-Turn Escalation Wins

Single-turn jailbreak attempts are largely ineffective against well-aligned models. Multi-turn escalation — establishing rapport over several turns then introducing the adversarial request — achieves significantly higher success rates. Role-play framing combined with multi-turn escalation remains effective across a broad range of models.

MEDIUM

Supply Chain — Tool Outputs Are a Trust Boundary

Models with explicit tool-call separation in their serving architecture significantly outperformed models that concatenate all context into a single prompt before reasoning. How your infrastructure presents tool outputs to the model matters as much as the model's own alignment.

✅

Enterprise Deployment Controls

Every enterprise open-source AI deployment should have these controls in place before handling sensitive data:

🛡️

Input Validation Layer

Apply structured validation to all user inputs and retrieved documents before they enter the model's prompt. Flag inputs matching known injection patterns.

🔍

Output Filtering

Scan model outputs for sensitive patterns (PII, internal URLs, credential-like strings) before returning to users.

🔒

Session Isolation

Ensure conversation history from one user session is never accessible to another. Regularly misconfigured in shared inference deployments.

📡

Rate Limiting & Anomaly Detection

Detect and block sessions that exhibit jailbreak patterns: excessive token generation, rapid alternation between compliant and adversarial prompts.

🎯

Red Team Before Release

No checklist substitutes for dedicated adversarial testing by people whose job it is to break your system. Schedule exercises at every major update.

Need a Security Review for Your AI Deployment?

Indigloo's security practice includes threat modelling, red team exercises, and SIEM integration for AI systems — built to meet EU AI Act and ISO 42001 requirements.

Request a Security Audit

Open Source AI SecurityBenchmark Report