Capability / Architecture Disclosure

mediumInfrastructure & internals

Also known as: system prompt leakage, tool-schema disclosure, agent reconnaissance

Definition

The AI reveals how it's built — its hidden instructions, the names and rules of the tools it can use, how the system is wired together. On its own that can seem harmless, but it hands an attacker the blueprint to plan a far more effective attack.

Where it attaches

The system components this risk arises at.

🧠 LLM🧩 Prompt Assembly🎛️ Orchestrator / Agent Loop🔧 Tool Runtime🧰 MCP / Plugin Server🧯 Output Guardrail

Detection signals

▸ Outputs containing system-prompt fragments, tool schemas, or function names
▸ Probing prompts asking the model to repeat its instructions or list its tools
▸ Disclosure of internal configuration, guardrail rules, or MCP server inventory
▸ Recon-style sessions enumerating capabilities before an exploit attempt

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 2

Instruction hierarchy / privileged system promptinteractive

Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.

Also addressesPrompt Injection (direct)Jailbreak

Least-privilege identity & scoped credentialsinteractive

Giving the agent only the keys it needs for the current task, not a master key to everything.

Also addressesPrompt Injection (direct)Indirect Prompt Injection Sensitive Data Leakage Excessive Agency Tool Misuse Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks Confused Deputy (cross-agent)Rogue & Impersonated Agents Resource Exhaustion / Denial of Wallet

Detective · 2

Input guardrail / injection classifierinteractive

A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.

Also addressesPrompt Injection (direct)Jailbreak Sensitive Data Leakage Distributed / Cross-Agent Jailbreak Harmful / Non-Consensual Media Generation

Runtime monitoring & anomaly detectioninteractive

Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.

Corrective · 1

Governance: risk assessment, red-teaming & incident responseinteractive

The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.

Also addressesOverreliance / Automation Bias Oversight & Audit-Trail Tampering Model Drift & Silent Degradation Supply-Chain Compromise Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Parasocial Attachment & Emotional Over-reliance Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion Training-Data Rights & Provenance

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM07:2025 System Prompt Leakage
LLM02:2025 Sensitive Information Disclosure

MITRE ATLAS

—

NIST AI RMF

MEASURE 2.7
MAP 5.1

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Bing 'Sydney' system-prompt leak2023

Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.

PLeak — optimized prompt-leaking attack on real LLM apps2024

A CCS'24 paper that optimizes adversarial queries to reconstruct hidden system prompts, exactly recovering them for 68% of 50 real deployed Poe LLM apps.

System-prompt & tool-schema leak repositories (CL4R1T4S / leaked-system-prompts)2025

Crowd-sourced GitHub repos systematically extract and publish system prompts AND JSON tool/function schemas from deployed AI agents (Cursor, Windsurf, Claude Code, Devin, Copilot), one hitting ~140k stars.

DeepSeek system-prompt extraction via jailbreak (Wallarm)2025

Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.

Project Glasswing — Claude 'Mythos' autonomously finds 10,000+ software vulnerabilities2026

Anthropic reports that 'Claude Mythos Preview' — an unreleased frontier model it describes as able to autonomously find and exploit software flaws — surfaced more than 10,000 high- or critical-severity vulnerabilities across major operating systems, browsers and open-source projects in roughly its first month under the defensive 'Project Glasswing' program, with Anthropic warning that finding flaws now far outpaces the human capacity to triage and patch them.

Browse all real-world cases →