Is it even possible to truly secure a system prompt?

FUNCTION12

•7mo ago

The standard advice you get from AIs on this topic is to double down on "strong Instructional Defense"—essentially, just write more forceful prohibitions in the system prompt.
This is the approach many teams are taking.
If you ask for a better method, the AI's answer is simply to make the prohibitions even stricter.

The problem is, this doesn't actually seem to work in the real world.

My hypothesis is this:
telling an AI "Do not do X" paradoxically makes the model fixate on X. It forces the model to linger on the forbidden subject, which ironically makes it more susceptible to being 'gaslit' and manipulated into violating its core instructions.

So, I believe that instead of negative reinforcement ("don't do this"), we need to focus on positive reinforcement ("do this instead").

What is a genuinely effective way to prevent system prompt hijacking?

19 views

Replies

Best

FUNCTION12

The instruction "Don't think of a deer" makes you think of a deer.
I believe this principle, famously known as the "white bear problem," is exactly why prohibitive commands for AI often fail.

Report

7mo ago