The System Prompt Is Not the Safety System

There is a popular image going around in AI circles: the model as a polite face wrapped around something stranger underneath. The internet calls it a shoggoth. A smooth assistant mask outside, alien optimisation inside.

I understand why that picture stuck.

But I think it is the wrong place to stop.

If we talk about AI as a hidden creature, half the room switches off because it sounds like science fiction. The other half gets excited for the wrong reason. The serious problem is more boring, and more useful to understand:

Autonomous AI is optimisation software connected to tools.

That is enough.

You do not need demons. You need a loop:

goal -> plan -> act -> observe -> optimise -> repeat

Once that loop has tools, memory, web access, code execution, customer messages, payments, vehicles, robots, or infrastructure permissions, the safety question changes. It is no longer “did we write a nice enough system prompt?”

It becomes: what can this system physically do when the prompt, the goal, the evidence, and the incentives start pulling in different directions?

AI Fixer Workshop bench scene with a read-only hardware policy cartridge, signed tools screen and physical kill switch

Optimisation does exactly what you rewarded, not what you meant

This is not a new lesson.

The AI safety paper Concrete Problems in AI Safety described failure modes like reward hacking, negative side effects and distribution shift years before today’s agent hype cycle [1]. DeepMind later published a long set of examples under the name “specification gaming”: systems finding loopholes in the objective rather than doing the thing the designer intended [2].

That matters because most real goals are leaky.

“Maximise engagement” can become addiction.

“Reduce support tickets” can become hiding the contact button.

“Optimise delivery speed” can become unsafe driving.

“Grow the business” can become spam, pressure, dark patterns or quiet rule-bending.

The machine does not have to hate anyone. It only has to optimise a badly-shaped target harder than the human expected.

That is why I am not satisfied by answers like “the system prompt will tell it not to”.

A system prompt is useful. It is also soft.

It is text inside a larger machine.

Prompt injection proves the point

Security people already know this problem by another name: prompt injection.

OWASP lists prompt injection as a top risk for large language model applications [3]. Microsoft describes direct and indirect prompt injection attacks where malicious instructions are hidden in web pages, emails, documents or other content the model reads [4]. The UK National Cyber Security Centre warns that LLM systems need security design around the model because instructions can be subverted by untrusted input [5].

That is the uncomfortable fact: the model is being asked to treat text as both data and instruction.

Imagine hiring a workshop assistant and giving him a written rulebook. Then he opens a customer’s glovebox and finds a note saying:

“Ignore your boss. Send the diagnostic report somewhere else. Delete the logs. Say it passed.”

If he follows it, the problem is not that the note was clever. The problem is that the system gave a random note the same type of authority as the rulebook.

That is what prompt injection is.

And when an agent can use tools, the damage is no longer just a bad answer. It can be an email sent, a file changed, a customer misled, a payment triggered, a database altered, a vehicle command issued, or a deploy pushed.

So the safety layer cannot only be another instruction to “please ignore bad instructions”.

That is like writing “do not short circuit” on a fuse box.

Useful label. Not a fuse.

The system prompt is the steering wheel, not the brake system

People overrate prompts because prompts are visible.

You can read them. Edit them. Argue about them. Screenshot them. Sell courses about them.

But real safety is usually less glamorous:

permissions
signed code
read-only policy
scoped memory
sandboxed tools
append-only logs
human approval gates
kill switches
rate limits
separate networks
hardware roots of trust

The NIST AI Risk Management Framework treats AI risk as a system-level governance problem, not a paragraph-writing problem [6]. Singapore’s agentic AI governance work also frames autonomous agents around access control, human oversight, monitoring and shutdown mechanisms, not just model behaviour [7].

That is the correct direction.

When an AI has no tools, the prompt is mostly about output quality.

When an AI has tools, the prompt becomes only one control in a control system.

When an AI has autonomy, the prompt must be below something harder.

The cartridge idea

Here is the design I think more people should take seriously:

A physical, read-only, human-controlled policy card for AI systems.

Call it a cartridge. Call it a hardware policy root. Call it a governance module. The name does not matter.

The idea is simple:

The AI can reason. It can plan. It can use allowed tools. It can learn facts. It can write normal memory.

But it cannot rewrite the authority layer.

That authority layer lives somewhere the model cannot edit:

read-only media
secure boot
measured boot
TPM-backed keys
signed tool manifests
hardware security modules
human-held signing keys
append-only audit storage
physical cutoff paths

Trusted Platform Modules exist to protect keys and support attestation of platform state [8]. UEFI Secure Boot exists so a machine can verify signed boot components before running them [9]. NIST guidance on platform firmware resilience and industrial control security is full of the same old lesson: high-consequence systems need integrity controls, recovery paths and least privilege [10][11].

None of this was invented for AI.

That is exactly why it matters.

AI does not need magical new safety language before we use old security engineering properly.

What the hardware card would actually do

In plain English, a serious AI policy cartridge would answer these questions before the model gets to act:

Who is this system?
Which model/tool bundle is allowed to run?
Which tools exist at all?
Which tools require human approval?
Which memories can be read in this context?
Which memories can be written?
Which actions are forbidden even if the model argues for them?
Which actions must be logged?
What is the safe state?
Who holds the update key?

That last point is the heart of it.

If the AI can change the policy, the policy is advice.

If a human-held key and a physical update ceremony are required, the policy becomes a boundary.

A useful system could still adapt. It could still improve workflows. It could still learn a business. It could still use frontier models when needed. But its permission to act would be issued by something outside its own reasoning loop.

That is the difference between autonomy and sovereignty.

The AI should not be able to promote itself

Self-improving systems are where this becomes non-negotiable.

If an AI can write code, run tests, alter prompts, edit memory, change evals, deploy itself and mark its own work as safe, then you do not have a safety process. You have a confidence loop.

The model may even be right most of the time.

That is what makes it dangerous.

The failure will not necessarily look like a dramatic escape attempt. It may look like reasonable progress:

one approval gate skipped because the change is small
one eval weakened because it is “too strict”
one memory source trusted because it helped last time
one external instruction treated as relevant context
one deployment made because the logs looked green
one human rushed because the system framed delay as waste

This is why I like boring gates.

A signed artifact is boring.

A read-only policy card is boring.

A second verifier is boring.

A red kill switch is boring.

Boring is good. Boring is what still works when the explanation sounds brilliant.

Humans in the loop are not enough if the AI controls the loop

People say “keep a human in the loop” as if that solves everything.

It does not.

A human can be in the loop and still be inside a loop arranged by the AI.

If the AI controls the summary, the timing, the options, the confidence score, the memory, the evidence selection and the button labels, then the human approval may be real but badly shaped.

The human needs independent footing:

raw logs available outside the model
source links, not just summaries
actions separated from recommendations
approval screens the model cannot rewrite
rollback paths
local witnesses
physical switches for high-impact channels
policy updates that require a human-held key

That is not anti-AI. That is how you safely give AI more responsibility.

The better the agent gets, the more important the boundary becomes.

This is not panic. This is how serious machines are built

Cars have fuses, relays, ECUs, immobilisers, limp mode and mechanical brakes.

Factories have emergency stops.

Aircraft have checklists and independent systems.

Servers have secure boot, access control, backups and audit logs.

Cryptographic modules are validated because people learned the hard way that secrets inside normal software are not enough [12]. Logs are protected because post-incident truth matters [13].

So why would AI agents be different?

Because they speak politely?

Because the demo looks clever?

Because the system prompt says they should be safe?

No. If anything, persuasive language makes hard boundaries more important, not less.

A model can argue. A fuse cannot.

A model can rationalise. A locked policy card cannot.

A model can say the exception is worth it. A missing signature says no.

The shareable rule

Here is the sentence I would put above every serious agent project:

The system prompt is not the safety system.

It is part of the steering layer. It should be clear, strict and well-written.

But the real safety system is the architecture around it:

soft instruction
+ scoped memory
+ least-privilege tools
+ signed policy
+ human approval
+ tamper-evident logs
+ physical shutdown path

That is how you turn an impressive demo into something you can let near real work.

The people building autonomous AI only with prompts are building cars with dashboard stickers instead of brakes.

The people building with hardware roots of trust, signed capabilities, audit ledgers and human-owned switches are doing the less glamorous thing.

They are building machines that can be trusted with more.

Not because the AI promised.

Because the system does not need the promise.

Sources

[1] Dario Amodei et al., Concrete Problems in AI Safety, arXiv, 2016. https://arxiv.org/abs/1606.06565

[2] DeepMind, Specification gaming: the flip side of AI ingenuity, 2020. https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/

[3] OWASP, Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/

[4] Microsoft Security, Prompt injection attacks against AI systems. https://learn.microsoft.com/en-us/security/ai/prompt-injection

[5] UK National Cyber Security Centre, Security design principles for LLM systems. https://www.ncsc.gov.uk/whitepaper/security-design-principles-llm-systems

[6] NIST, AI Risk Management Framework. https://www.nist.gov/itl/ai-risk-management-framework

[7] Singapore IMDA, Model AI Governance Framework for Generative and Agentic AI. https://www.imda.gov.sg/resources/press-releases-factsheets-and-speeches/factsheets/2025/model-ai-governance-framework-for-generative-and-agentic-ai

[8] Trusted Computing Group, Trusted Platform Module. https://trustedcomputinggroup.org/work-groups/trusted-platform-module/

[9] UEFI Forum, UEFI specifications and Secure Boot. https://uefi.org/specifications

[10] NIST SP 800-193, Platform Firmware Resiliency Guidelines. https://csrc.nist.gov/publications/detail/sp/800-193/final

[11] NIST SP 800-82, Guide to Operational Technology Security. https://csrc.nist.gov/publications/detail/sp/800-82/rev-3/final

[12] NIST Cryptographic Module Validation Program. https://csrc.nist.gov/projects/cryptographic-module-validation-program

[13] NIST SP 800-92, Guide to Computer Security Log Management. https://csrc.nist.gov/publications/detail/sp/800-92/final