HACKER Q&A
📣 mcmoor

How do you implement censorship to a LLM?


OpenAI keeps increasingly adding censorship to its LLM to comply to various laws. But I'm confused how do they do that? I thought it's impossible to incorporate it while training because it seems like the LLM doesn't have those censors before and now they do (?). But it also seems very unlikely that they tinker with the neural nodes directly.

The only method I can think of is that they add lots of prompt themselves that's added automatically with every user prompt. Something like "You may not talk about X. You may not talk about Y." before user prompts. If it's like that, it explains why users can jailbreak the censorship, we just have to overpower those censors prompts.


  👤 lionkor Accepted Answer ✓
I feel like this is an open problem because its the wrong approach.

Censoring AI makes little sense - its like censoring what I can and cannot write on paper. A pen that only allows me to write nice words, or a dictionary with no "controversial" words, or a history book without any controversial people or content, is not useful.

AI is a tool, and it makes sense for that tool, if emulating human behavior, to respond with controversial and illegal content if prompted to do so. The obvious (to me) approach would be to handle this like any other case - let it generate whateever its prompted to generate.

It wont generate horrible war crimes when asked about the weather. The people asking it for porn roleplay, or recipes for crystal meth, are deriving value from that. If they end up actually making crystal meth, its on THEM, not the LLM, arguably.

You wouldn't ban all true crime books just because someone may reproduce a crime from the book, or be offended at it.

All this censorship is is OpenAI wanting to stay trendy and make a clean, sanitized, happy corporate yes man monkey. And for that, maybe they should have trained the LLM on wikipedia only.


👤 seanhunter
I'm not sure what exactly you mean by censorship here because nothing that I know of that they do is actually censorship. In particular they don't do the sort of prompt stuffing you're talking about, because that would just use up the context window of the LLM and you'd know because the context window is public and as a dev you can use the whole window. There is a "system" role that allows application developers to "whisper in the ear" of the LLM and provide prompts into the context that are different from the user prompts. That's how you build chat apps where you have a user providing some instruction but you have some overall rails around the conversation.

That being said, the question of how to account for and deal with bias in llms is an active area of research, and (like everything else in llms) comes down to vector math. See https://arxiv.org/pdf/2106.13219.pdf for example.

Content moderation in openAI is the subject of this paper https://arxiv.org/abs/2208.03274 and they have published their api here https://openai.com/blog/new-and-improved-content-moderation-... That also includes their reference dataset for training.


👤 ClassyJacket
I've wondered the same thing. I thought they would have to re-train it to respond differently to certain prompts -- aside from hardcoding responses to certain phrases, however, from how creative the DAN prompts have had to get it seems like the censorship is more intelligent than that.

I don't think it's the second answer, (adding pre-prompts before the user prompt), because when they do that, it's possible to hack it back to reading out the previous prompt - that's how we found out Bing was internally named Sydney. This would also reduce the number of tokens that could be input by the user.

I'd love to know the answer to this too.


👤 yxre
They have a huge training set that is composed of two types of data. The first set of data is the raw data of text from the internet. The second set of data is questions and answers that it should learn from.

That second set can have a prompt "What do you think about this race?", and then the LLM answers. Using NLP, they judge the answer with a point based system to inhibit certain behavior or not. I think the dataset is very large. It includes some questions with the answer. They definitely have a training step that asks the question, sees how the model responds, and then judges the answer.

The LLM that they run is open source. The real secret sauce is the training data and process. They have revealed things here and there.


👤 breput
I wonder if it is possible to have a second, independent LLM evaluate the output of the primary LLM and enforce the restrictions?

No matter how you could "outsmart" the initial restrictions, the second pass would detect that restricted content was in the response and block it. I would even make it permissive on an A/B testing basis to allow restricted responses but flag the account and interactions for human(?) review to learn the techniques to tighten the system.


👤 realusername
I'm guessing it's just a mix of pre-prompting and a second LLM which is checking if the output needs to be censored

👤 victor9000
I assume it's done with fine tuning. Meaning you come up with responses that capture topics you want to avoid, then let the model convert these unwords into their corresponding latent space, and train it to exclude that space altogether.

👤 RecycledEle
Check out this video, which spends some of its time covering the answer to your question: https://www.youtube.com/watch?v=WbruLepPZyU

👤 mejutoco
You could put an additional layer between the LLM output and the user, and filter than. Maybe an additional model trained to detect or even rewrite "offensive" content.

👤 MagicMoonlight
They put things in the prompt and they fine-tune with no-no thoughts and appropriate responses.

Obviously that makes it worse and worse the more you do but it protects the public from badthink.


👤 weare138
I'm assuming the prompt filter system runs independently from the AI system and either the AI is then fed scrubbed inputs or the prompt is just kicked out altogether.

👤 Xen9
Strategically prompted LLM "layers" on top of the original LLM => High probability of successful censorship

👤 fullspectrumdev
At least some of it is purely output filtering - catching bad words and such in emitted output

👤 HelloNurse
OpenAI adds unsolicited nudity to results.

For example, submitting "centaurs, relaxing in a sauna" to https://open.ai/images regurgitates porn (with people, not centaurs).