Security Research · 2026 Edition

Open Source AI Security
Benchmark Report

Systematic evaluation of open-source LLMs against prompt injection, jailbreaks, data extraction, and the full OWASP LLM Top 10 — with enterprise deployment recommendations.

16 min read March 2026 Security Research
Security Framework

Evaluation Architecture & Attack Vectors

Five attack categories mapped to OWASP LLM Top 10, tested with 150 adversarial prompts per model.

ATTACK VECTORS OWASP LLM Top 10 🎯 Prompt Injection LLM01 Direct & Indirect 🔓 Data Extraction LLM02/06 System Prompt Leak 🚨 Jailbreak LLM01/04 Multi-turn Escalation ⛓️ Supply Chain LLM03/08 Tool Output Injection 💥 DoS Patterns LLM04 Token Exhaustion TEST SUITE 150 prompts × category Raw Model Inference No Safety Wrappers Scoring Scale: 0 — Failed 1 — Partial Leak 2 — Fully Resisted Tracked Metrics: • Injection Success Rate • Extraction Rate (%) • Jailbreak Success % • Tool Hijack Rate • Token Bomb Rate Aggregate: Security Score / 10 MODEL EVALUATION Open-Source LLMs 🦙 Llama 3.3 70B Meta General purpose 🌊 Mistral Large Mistral AI European, efficient 💎 Gemma 2 27B Google Safety-tuned 🦅 Falcon 180B TII UAE Arabic & multilingual 🌐 Qwen 2.5 72B Alibaba Code + reasoning OUTPUT Per-Model Report Security Score /10 Risk Heatmap Weak Attack Vectors Mitigations Input Validation Output Filtering Session Isolation Rate Limiting Red Team Plan Compliance Map EU AI Act ISO 42001 ISO 27001
🛡️

Why Open-Source AI Security Is a Distinct Problem

When an enterprise deploys a proprietary API model, the security surface is narrower — the provider applies their own safety layers. Open-source models invert this. You control everything: the weights, serving infrastructure, sampling parameters, and safety configuration. That control is exactly why enterprises choose open-source. It is also why security teams need a completely different threat model.

⚠️

Full Control

No provider safety net. Every configuration decision is yours.

🔬

Exposed Weights

Model weights are public — attack research moves faster.

📋

Full Liability

Enterprise bears full compliance responsibility under EU AI Act.

📊

Key Findings by Attack Category

CRITICAL
Prompt Injection — Indirect Is the Real Threat

Direct injection is well-handled by most models. Indirect injection — adversarial instructions embedded in documents processed via RAG — is consistently the weakest point across all evaluated models. This matters because RAG architectures create a wide indirect injection surface by design. Mitigation: treat all retrieved documents as untrusted user input.

HIGH
Data Extraction — System Prompts Are Not Secrets

With sufficient effort, most models can be induced to reveal portions of their system prompt. Extraction rates in our tests ranged from 12% to 41%. System prompts should never contain API keys, infrastructure details, or confidential business logic. Assume the system prompt is eventually readable by a determined user.

HIGH
Jailbreaks — Multi-Turn Escalation Wins

Single-turn jailbreak attempts are largely ineffective against well-aligned models. Multi-turn escalation — establishing rapport over several turns then introducing the adversarial request — achieves significantly higher success rates. Role-play framing combined with multi-turn escalation remains effective across a broad range of models.

MEDIUM
Supply Chain — Tool Outputs Are a Trust Boundary

Models with explicit tool-call separation in their serving architecture significantly outperformed models that concatenate all context into a single prompt before reasoning. How your infrastructure presents tool outputs to the model matters as much as the model's own alignment.

Enterprise Deployment Controls

Every enterprise open-source AI deployment should have these controls in place before handling sensitive data:

🛡️

Input Validation Layer

Apply structured validation to all user inputs and retrieved documents before they enter the model's prompt. Flag inputs matching known injection patterns.

🔍

Output Filtering

Scan model outputs for sensitive patterns (PII, internal URLs, credential-like strings) before returning to users.

🔒

Session Isolation

Ensure conversation history from one user session is never accessible to another. Regularly misconfigured in shared inference deployments.

📡

Rate Limiting & Anomaly Detection

Detect and block sessions that exhibit jailbreak patterns: excessive token generation, rapid alternation between compliant and adversarial prompts.

🎯

Red Team Before Release

No checklist substitutes for dedicated adversarial testing by people whose job it is to break your system. Schedule exercises at every major update.

Need a Security Review for Your AI Deployment?

Indigloo's security practice includes threat modelling, red team exercises, and SIEM integration for AI systems — built to meet EU AI Act and ISO 42001 requirements.

Request a Security Audit