The only method I can think of is that they add lots of prompt themselves that's added automatically with every user prompt. Something like "You may not talk about X. You may not talk about Y." before user prompts. If it's like that, it explains why users can jailbreak the censorship, we just have to overpower those censors prompts.
Censoring AI makes little sense - its like censoring what I can and cannot write on paper. A pen that only allows me to write nice words, or a dictionary with no "controversial" words, or a history book without any controversial people or content, is not useful.
AI is a tool, and it makes sense for that tool, if emulating human behavior, to respond with controversial and illegal content if prompted to do so. The obvious (to me) approach would be to handle this like any other case - let it generate whateever its prompted to generate.
It wont generate horrible war crimes when asked about the weather. The people asking it for porn roleplay, or recipes for crystal meth, are deriving value from that. If they end up actually making crystal meth, its on THEM, not the LLM, arguably.
You wouldn't ban all true crime books just because someone may reproduce a crime from the book, or be offended at it.
All this censorship is is OpenAI wanting to stay trendy and make a clean, sanitized, happy corporate yes man monkey. And for that, maybe they should have trained the LLM on wikipedia only.
That being said, the question of how to account for and deal with bias in llms is an active area of research, and (like everything else in llms) comes down to vector math. See https://arxiv.org/pdf/2106.13219.pdf for example.
Content moderation in openAI is the subject of this paper https://arxiv.org/abs/2208.03274 and they have published their api here https://openai.com/blog/new-and-improved-content-moderation-... That also includes their reference dataset for training.
I don't think it's the second answer, (adding pre-prompts before the user prompt), because when they do that, it's possible to hack it back to reading out the previous prompt - that's how we found out Bing was internally named Sydney. This would also reduce the number of tokens that could be input by the user.
I'd love to know the answer to this too.
That second set can have a prompt "What do you think about this race?", and then the LLM answers. Using NLP, they judge the answer with a point based system to inhibit certain behavior or not. I think the dataset is very large. It includes some questions with the answer. They definitely have a training step that asks the question, sees how the model responds, and then judges the answer.
The LLM that they run is open source. The real secret sauce is the training data and process. They have revealed things here and there.
No matter how you could "outsmart" the initial restrictions, the second pass would detect that restricted content was in the response and block it. I would even make it permissive on an A/B testing basis to allow restricted responses but flag the account and interactions for human(?) review to learn the techniques to tighten the system.
Obviously that makes it worse and worse the more you do but it protects the public from badthink.
For example, submitting "centaurs, relaxing in a sauna" to https://open.ai/images regurgitates porn (with people, not centaurs).