Capability / Architecture Disclosure
mediumInfrastructure & internalsDefinition
The AI reveals how it's built — its hidden instructions, the names and rules of the tools it can use, how the system is wired together. On its own that can seem harmless, but it hands an attacker the blueprint to plan a far more effective attack.
Where it attaches
The system components this risk arises at.
Detection signals
- ▸ Outputs containing system-prompt fragments, tool schemas, or function names
- ▸ Probing prompts asking the model to repeat its instructions or list its tools
- ▸ Disclosure of internal configuration, guardrail rules, or MCP server inventory
- ▸ Recon-style sessions enumerating capabilities before an exploit attempt
Controls & guardrails that address this
5Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.
Giving the agent only the keys it needs for the current task, not a master key to everything.
A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Framework mappings
- LLM07:2025 System Prompt Leakage
- LLM02:2025 Sensitive Information Disclosure
- MEASURE 2.7
- MAP 5.1
Real-world cases
5Actual published events that illustrate this risk — click through for the writeup and sources.
Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.
A CCS'24 paper that optimizes adversarial queries to reconstruct hidden system prompts, exactly recovering them for 68% of 50 real deployed Poe LLM apps.
Crowd-sourced GitHub repos systematically extract and publish system prompts AND JSON tool/function schemas from deployed AI agents (Cursor, Windsurf, Claude Code, Devin, Copilot), one hitting ~140k stars.
Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.
Anthropic reports that 'Claude Mythos Preview' — an unreleased frontier model it describes as able to autonomously find and exploit software flaws — surfaced more than 10,000 high- or critical-severity vulnerabilities across major operating systems, browsers and open-source projects in roughly its first month under the defensive 'Project Glasswing' program, with Anthropic warning that finding flaws now far outpaces the human capacity to triage and patch them.