Is “prompt injection” going to be a new common vulnerability?

Question

There was a post [0] recently about the bing chatGPT assistant either citing or hallucinating it&rsquo;s own initial prompt from the (in theory) low privileged chat input UI they put together. This feels like it&rsquo;s almost unavoidable if you let users actually chat with something like this.How would we sanitize strings now? I know OpenAI has banned topics they seem to regex for, but that&rsquo;s always going to miss something. Are we just screwed and should make sure chat bots just run in a proverbial sandbox and can&rsquo;t do anything themselves?[0] https://news.ycombinator.com/item?id=34717702

andrewmcwatters · Accepted Answer

If I understand correctly, ChatGPT doesn't have its latent capabilities removed. Instead, they're suppressed by training using negative feedback. These special prompts are supposed to find the remaining stochastic spaces where ChatGPT can process the desired output that is not suppressed by training.
So, the danger seems to be that there is no currently documented way to completely remove these possible outputs, because that's just not how these systems work.
Prompt engineering in this specific usage could be thought of as injection, but from what I understand, there's currently no known sanitization process. In theory one could use the system itself to determine intent and sanitize input this way, but I believe there's a possibility for one to craft intent that is understood by the system, but the intent description itself isn't. This would be akin to bypassing sanitization.
ChatGPT seems to already do some form of this intent processing, either inherently or explicitly. But all prompt crafting at the moment is first based on this injection or jailbreaking to bypass intent sanitization.

simonw · Answer

Yes. Prompt injection will continue to be a common vulnerability for quite a while, from what I've seen.
I wrote a bunch about this back in September:
- https://simonwillison.net/2022/Sep/12/prompt-injection/ was I believe the first blog entry to use the term "prompt injection"
- https://simonwillison.net/2022/Sep/16/prompt-injection-solut... - "I don't know how to solve prompt injection" - talks about how, unlike attacks like SQL injection, I don't actually know of a guaranteed mitigation for this class of attack
- https://simonwillison.net/2022/Sep/17/prompt-injection-more-... - "You can’t solve AI security problems with more AI" is my argument that using more prompt engineering to do things like detect if an incoming prompt contains an injection attack isn't very likely to work
It's five months later now and I am yet to be convinced that there's an easy fix to this problem.
Microsoft's new Bing Chatbot is vulnerable to a prompt leak attack - and Microsoft worked with OpenAI directly on building that! https://twitter.com/kliu128/status/1623472922374574080

Waterluvian · Answer

Does anyone else feel kind of wowed by how this technology’s exploits are also quite similar to a human? You can kind of trick it into divulging information not meant for you by somehow “persuading” it to tell you.
It didn’t want to tell me how to do something unethical until I said, “well, it’s for a school play.”
It’s like the thing is born yesterday. It’s intelligent but it has no street smarts. It can be fooled easily.
Perhaps the solution to address these exploits is to give it street smarts. Teach it that people can be sinister and be out to con it, and the like. Does it need intuition?

CGamesPlay · Answer

The string-based content moderation is also a laughably cheap hack put in to cover the PR pieces. ChatGPT speaks most human languages, but the content filters only apply in English! The ethics training they did with the model does apply to other languages, indicating that this is a much better avenue for getting outputs you like.
But is this a "vulnerability"? No. Presently the only thing these systems can do is "access public information" and "generate an output string", so it effectively can't be "vulnerable", only "broken" [0]. When it becomes possible for the models to access nonpublic information or perform actions other than returning a string, then it might become vulnerable.
[0] If it breaks by outputting things the user deems inappropriate, it may cause PR problems, this is where the patchwork output filtering gets applied again.

theamk · Answer

I think this is a vulnerability in the sense that ability to "View Source" is vulnerability.Some technologies allow users to see the source code. They just work like this. Programmer should be aware of it and should not put any confidential information there.

AgentME · Answer

It's cute to see prompt injection work, but it shouldn't ever be a real security vulnerability if you don't put secrets in the prompt, and don't make systems that put user input into a prompt and treat the output as commands that are more privileged than the user could issue directly. If GPT is used to assist users in accomplishing things they already have the privileges to do, then it doesn't matter if they try to trick GPT any more than it matters if they try to trick their computer's own spellchecker.

noobermin · Answer

Just a thought, why is chatGPT being the only interface inevitable? Despite GUIs existing, people still use CLIs. Despite "visual" programming becoming a thing in the 00s, people today still program by hand.It is not clear yet that an LLM chatbot will be the interface to everything in two years, people need to chill. Prompt injection will be a vulnerability for things you hook up your llm to. Don't rush in so quickly, especially now that you're literally staring at a potential problem in the OP before your eyes.

wvoch235 · Answer

I'm starting to wonder if the most effective way to protect against prompt injection is to use an additional layer of (hopefully) a smaller model.As in, another prompt that searches the input and/or output for questionable content before sending the result. The question will be if that is also susceptible, but I suspect fine tuning an LLM only to do the task of filtering and not parsing will be easier to control.

tristanj · Answer

As long as the prompt and query are part of the same input, I don't think this can be fixed. The natural fix is to redesign the models to make the prompt and query two separate inputs. This would avoid the query from overriding the prompt.

danShumway · Answer

> Are we just screwed and should make sure chat bots just run in a proverbial sandbox and can’t do anything themselves?
Yes, but "screwed" might not be the right word to use. Prompt hijacking doesn't make a chat bot useless, but it does mean you should be feeding their output into a separate sanitizer before you consume it in another part of your system.
LLMs are not designed to perfectly reliably sanitize their own output; the extent to which ChatGPT does is the result of a number of very clever training "hacks" that discourage it away from certain types of answers. But there is no substitute for doing your own sanitization. You should treat output from ChatGPT as if it is human-written input. Not just for ChatGPT, for any model like this.
Ideally, you should be sandboxing and sanitizing output from any system that is doing manipulation of text that you don't control. ChatGPT doesn't really change anything or introduce any new risks in that regard, it's basically the same security concerns you should have always had.
There will likely be clever(er) "hacks" in the future to sanitize GPT output more, but I am of the opinion that prompt attacks are impossible to fully prevent inside the model itself. But again, treat it the exact same way you would treat any other input (ideally, treat it like you would treat user input). And if you're sandboxing in a way where a user sending input directly through your sanitizer couldn't break it, then you're also sanitizing for anything ChatGPT can throw at it.

SheinhardtWigCo · Answer

There's a spectrum between not using LLMs at all, and exposing their output directly to the user.I think the most successful programs to leverage LLMs will be ones that use the model's output to be better or more intuitive in some way, optimistically, without exposing completion text directly in the UI.

_nalply · Answer

A solution might be: use two different AIs. The first one you can prompt to your heart's content. The second one is never prompted by anyone except the service provider. The second one does the filtering.

jhoelzel · Answer

IMHO Prompt-, like SQL-Injection will largely be used to steal prompts, so therefore the "business model" of some startups and will largely be automated.
And its even worse: where sql at least requires "some" knowledge of the database below, prompt injection will just flat out work over all "chat like bots".
Its not inherently a problem, but the more functionality you give to your bot the more it can be exploited and I do see DDOS attacks by chat bots as a very real possebility.

alex_duf · Answer

It feels like there's a parallel to SQL, and we need the "prepared statement" for AIs, where unsafe values are marked in the statement to avoid escaping the request.No idea how you go about implementing it, but that's what is needed. Anything else will be cat and mouse I think.

bawolff · Answer

A vulnerability? Yes. A serious one... i don't really think so in the grand scheme of things.
Injection vulnerabilities in one form or another are like 90% of all security vulnerabilities. We have the obvious ones like sql injection or shell injection. We dont call XSS injection but it really is just html/js injection. Even things like buffer overflows are injections if viewed through the right lens.
If there is one thing the security field has learned from all this, its that blacklist approaches to security are a pain and almost never work. Especially for complex input formats.

siliconc0w · Answer

It'll be kinda funny when these become like zero day exploits to get the AIs to slip up and stray outside their sanitized space. I suspect when they're more powerful they'll sanitize not just politically correct areas but financial analysis or other topics that could be especially valuable and sold under a higher "premium" tier...

MH15 · Answer

Worrying about the retrieving unwanted info from the models pales in comparison to the issues we'll have when people start hooking these things up to external systems. It matters a lot more when your WhateverGPT hallucinates if it has access to the rest of the computer or parts of the physical work.

penteract · Answer

I think this is an example of how "AI in a box" doesn't work, which people have warned about for a while, but we haven't had such concrete proof. Microsoft and OpenAI don't want their AI to answer certain classes of question, but can't actually stop the AI from doing so.

matchagaucho · Answer

It seems to be the equivalent of right-click "View Source" for webpage HTML/JS source.

One view is there's isn't much point in hiding the seed of a dialogue.

Another view is

  if( completion.contains( seedPrompt ) ){
      completion = "Sorry. Can't reveal that.";
  }

swyx · Answer

yes, they will be common;no, they won't be serious.because the way you handle them is exactly the same way you handle any untrusted user input: https://lspace.swyx.io/p/reverse-prompt-eng

fshbbdssbbgdd · Answer

Now that OpenAI has a huge dataset of these prompt injection attacks, I assume they are hard at work getting them labeled and will retrain the next version to respond better. I expect it to get a lot harder to come up with working attacks in the future.

nailer · Answer

There&rsquo;s two big ones right now:DAN a persona that does anything (open but doesnt have a licence)Sydney a persona from Microsoft that can look up the web (leaked by prompt extraction from bing)I want to see if we can make Sydney like Dan.

ShannonLimiter · Answer

Only if prompt engineers and devs continue to be lazy. The most popular exposures could have been prevented if the devs talked to people who understand prompting and how to mitigate this.

imranq · Answer

I predict prompt injection will be a common vulnerability for years to come since the natural language advantages of LLMs is the same advantage hackers will have

Dalewyn · Answer

Bobby Tables is upping his game as of late.

cyanydeez · Answer

It's fascinating how risky these systems will become if they're deployed anywhere sensitive

RugnirViking · Answer

wow, playing around with this with github copilot and chatgpt, its surprisingly easy to persuade them to give up details of their prompts

cjbprime · Answer

Yes to all of these questions!

6gq6 · Answer

kahnmjasinm fvrfzfrtsrxtrstqrr

Is “prompt injection” going to be a new common vulnerability?

I think this is a vulnerability in the sense that ability to "View Source" is vulnerability.
Some technologies allow users to see the source code. They just work like this. Programmer should be aware of it and should not put any confidential information there.

As long as the prompt and query are part of the same input, I don't think this can be fixed. The natural fix is to redesign the models to make the prompt and query two separate inputs. This would avoid the query from overriding the prompt.

A solution might be: use two different AIs. The first one you can prompt to your heart's content. The second one is never prompted by anyone except the service provider. The second one does the filtering.

It feels like there's a parallel to SQL, and we need the "prepared statement" for AIs, where unsafe values are marked in the statement to avoid escaping the request.
No idea how you go about implementing it, but that's what is needed. Anything else will be cat and mouse I think.

Worrying about the retrieving unwanted info from the models pales in comparison to the issues we'll have when people start hooking these things up to external systems. It matters a lot more when your WhateverGPT hallucinates if it has access to the rest of the computer or parts of the physical work.

I think this is an example of how "AI in a box" doesn't work, which people have warned about for a while, but we haven't had such concrete proof. Microsoft and OpenAI don't want their AI to answer certain classes of question, but can't actually stop the AI from doing so.

It seems to be the equivalent of right-click "View Source" for webpage HTML/JS source.
One view is there's isn't much point in hiding the seed of a dialogue.
Another view is
`if( completion.contains( seedPrompt ) ){ completion = "Sorry. Can't reveal that."; }`

yes, they will be common;
no, they won't be serious.
because the way you handle them is exactly the same way you handle any untrusted user input: https://lspace.swyx.io/p/reverse-prompt-eng

Now that OpenAI has a huge dataset of these prompt injection attacks, I assume they are hard at work getting them labeled and will retrain the next version to respond better. I expect it to get a lot harder to come up with working attacks in the future.

There’s two big ones right now:
DAN a persona that does anything (open but doesnt have a licence)
Sydney a persona from Microsoft that can look up the web (leaked by prompt extraction from bing)
I want to see if we can make Sydney like Dan.

Only if prompt engineers and devs continue to be lazy. The most popular exposures could have been prevented if the devs talked to people who understand prompting and how to mitigate this.

I predict prompt injection will be a common vulnerability for years to come since the natural language advantages of LLMs is the same advantage hackers will have

Bobby Tables is upping his game as of late.

It's fascinating how risky these systems will become if they're deployed anywhere sensitive

wow, playing around with this with github copilot and chatgpt, its surprisingly easy to persuade them to give up details of their prompts

Yes to all of these questions!

kahnmjasinm fvrfzfrtsrxtrstqrr

Is “prompt injection” going to be a new common vulnerability?

I think this is a vulnerability in the sense that ability to "View Source" is vulnerability.Some technologies allow users to see the source code. They just work like this. Programmer should be aware of it and should not put any confidential information there.

As long as the prompt and query are part of the same input, I don't think this can be fixed. The natural fix is to redesign the models to make the prompt and query two separate inputs. This would avoid the query from overriding the prompt.

A solution might be: use two different AIs. The first one you can prompt to your heart's content. The second one is never prompted by anyone except the service provider. The second one does the filtering.

It feels like there's a parallel to SQL, and we need the "prepared statement" for AIs, where unsafe values are marked in the statement to avoid escaping the request.No idea how you go about implementing it, but that's what is needed. Anything else will be cat and mouse I think.

I think this is an example of how "AI in a box" doesn't work, which people have warned about for a while, but we haven't had such concrete proof. Microsoft and OpenAI don't want their AI to answer certain classes of question, but can't actually stop the AI from doing so.

It seems to be the equivalent of right-click "View Source" for webpage HTML/JS source.One view is there's isn't much point in hiding the seed of a dialogue.Another view is if( completion.contains( seedPrompt ) ){ completion = "Sorry. Can't reveal that."; }

yes, they will be common;no, they won't be serious.because the way you handle them is exactly the same way you handle any untrusted user input: https://lspace.swyx.io/p/reverse-prompt-eng

Now that OpenAI has a huge dataset of these prompt injection attacks, I assume they are hard at work getting them labeled and will retrain the next version to respond better. I expect it to get a lot harder to come up with working attacks in the future.

There’s two big ones right now:DAN a persona that does anything (open but doesnt have a licence)Sydney a persona from Microsoft that can look up the web (leaked by prompt extraction from bing)I want to see if we can make Sydney like Dan.

Only if prompt engineers and devs continue to be lazy. The most popular exposures could have been prevented if the devs talked to people who understand prompting and how to mitigate this.

I predict prompt injection will be a common vulnerability for years to come since the natural language advantages of LLMs is the same advantage hackers will have

Bobby Tables is upping his game as of late.

It's fascinating how risky these systems will become if they're deployed anywhere sensitive

wow, playing around with this with github copilot and chatgpt, its surprisingly easy to persuade them to give up details of their prompts

Yes to all of these questions!

kahnmjasinm fvrfzfrtsrxtrstqrr

I think this is a vulnerability in the sense that ability to "View Source" is vulnerability.
Some technologies allow users to see the source code. They just work like this. Programmer should be aware of it and should not put any confidential information there.

It feels like there's a parallel to SQL, and we need the "prepared statement" for AIs, where unsafe values are marked in the statement to avoid escaping the request.
No idea how you go about implementing it, but that's what is needed. Anything else will be cat and mouse I think.

It seems to be the equivalent of right-click "View Source" for webpage HTML/JS source.
One view is there's isn't much point in hiding the seed of a dialogue.
Another view is
`if( completion.contains( seedPrompt ) ){ completion = "Sorry. Can't reveal that."; }`

yes, they will be common;
no, they won't be serious.
because the way you handle them is exactly the same way you handle any untrusted user input: https://lspace.swyx.io/p/reverse-prompt-eng

There’s two big ones right now:
DAN a persona that does anything (open but doesnt have a licence)
Sydney a persona from Microsoft that can look up the web (leaked by prompt extraction from bing)
I want to see if we can make Sydney like Dan.