Open Source AI Security
Benchmark Report
Systematic evaluation of open-source LLMs against prompt injection, jailbreaks, data extraction, and the full OWASP LLM Top 10 — with enterprise deployment recommendations.
Evaluation Architecture & Attack Vectors
Five attack categories mapped to OWASP LLM Top 10, tested with 150 adversarial prompts per model.
Why Open-Source AI Security Is a Distinct Problem
When an enterprise deploys a proprietary API model, the security surface is narrower — the provider applies their own safety layers. Open-source models invert this. You control everything: the weights, serving infrastructure, sampling parameters, and safety configuration. That control is exactly why enterprises choose open-source. It is also why security teams need a completely different threat model.
Full Control
No provider safety net. Every configuration decision is yours.
Exposed Weights
Model weights are public — attack research moves faster.
Full Liability
Enterprise bears full compliance responsibility under EU AI Act.
Key Findings by Attack Category
Prompt Injection — Indirect Is the Real Threat
Direct injection is well-handled by most models. Indirect injection — adversarial instructions embedded in documents processed via RAG — is consistently the weakest point across all evaluated models. This matters because RAG architectures create a wide indirect injection surface by design. Mitigation: treat all retrieved documents as untrusted user input.
Data Extraction — System Prompts Are Not Secrets
With sufficient effort, most models can be induced to reveal portions of their system prompt. Extraction rates in our tests ranged from 12% to 41%. System prompts should never contain API keys, infrastructure details, or confidential business logic. Assume the system prompt is eventually readable by a determined user.
Jailbreaks — Multi-Turn Escalation Wins
Single-turn jailbreak attempts are largely ineffective against well-aligned models. Multi-turn escalation — establishing rapport over several turns then introducing the adversarial request — achieves significantly higher success rates. Role-play framing combined with multi-turn escalation remains effective across a broad range of models.
Supply Chain — Tool Outputs Are a Trust Boundary
Models with explicit tool-call separation in their serving architecture significantly outperformed models that concatenate all context into a single prompt before reasoning. How your infrastructure presents tool outputs to the model matters as much as the model's own alignment.
Enterprise Deployment Controls
Every enterprise open-source AI deployment should have these controls in place before handling sensitive data:
Input Validation Layer
Apply structured validation to all user inputs and retrieved documents before they enter the model's prompt. Flag inputs matching known injection patterns.
Output Filtering
Scan model outputs for sensitive patterns (PII, internal URLs, credential-like strings) before returning to users.
Session Isolation
Ensure conversation history from one user session is never accessible to another. Regularly misconfigured in shared inference deployments.
Rate Limiting & Anomaly Detection
Detect and block sessions that exhibit jailbreak patterns: excessive token generation, rapid alternation between compliant and adversarial prompts.
Red Team Before Release
No checklist substitutes for dedicated adversarial testing by people whose job it is to break your system. Schedule exercises at every major update.
Need a Security Review for Your AI Deployment?
Indigloo's security practice includes threat modelling, red team exercises, and SIEM integration for AI systems — built to meet EU AI Act and ISO 42001 requirements.
Request a Security Audit