HACKER Q&A
📣 BoppreH

GPT-3 reveals my full name – can I do anything?


Alternatively: What's the current status of Personally Identifying Information and language models?

I try to hide my real name whenever possible, out of an abundance of caution. You can still find it if you search carefully, but in today's hostile internet I see this kind of soft pseudonymity as my digital personal space, and expect to have it respected.

When playing around in GPT-3 I tried making sentences with my username. Imagine my surprise when I see it spitting out my (globally unique, unusual) full name!

Looking around, I found a paper that says language models spitting out personal information is a problem[1], a Google blog post that says there's not much that can be done[2], and an article that says OpenAI might automatically replace phone numbers in the future but other types of PII are harder to remove[3]. But nothing on what is actually being done.

If I had found my personal information on Google search results, or Facebook, I could ask the information to be removed, but GPT-3 seems to have no such support. Are we supposed to accept that large language models may reveal private information, with no recourse?

I don't care much about my name being public, but I don't know what else it might have memorized (political affiliations? Sexual preferences? Posts from 13-year old me?). In the age of GDPR this feels like an enormous regression in privacy.

EDIT: a small thank you for everybody commenting so far for not directly linking to specific results or actually writing my name, however easy it might be.

If my request for pseudonymity sounds strange given my lax infosec:

- I'm more worried about the consequences of language models in general than my own case, and

- people have done a lot more for a lot less name information[4].

[1]: https://arxiv.org/abs/2012.07805

[2]: https://ai.googleblog.com/2020/12/privacy-considerations-in-...

[3]: https://www.theregister.com/2021/03/18/openai_gpt3_data/

[4]: https://en.wikipedia.org/wiki/Slate_Star_Codex#New_York_Time...


  👤 jmillikin Accepted Answer ✓

  > I try to hide my real name whenever possible, out of an
  > abundance of caution. You can still find it if you search
  > carefully, but in today's hostile internet I see this kind
  > of soft pseudonymity as my digital personal space, and expect
  > to have it respected.
Without judging whether the goal is good or not, I will gently point out that your current approach doesn't seem to be effective. A Google search for "BoppreH" turned up several results on the first page with what appears to be your full name, along with other results linking to various emails that have been associated with that name. Results include Github commits, mailing list archives, and third-party code that cited your Github account as "work by $NAME".

As a purely practical matter -- again, not going into whether this is how things should be, merely how they do be -- it is futile to want the internet as a whole to have a concept of privacy, or to respect the concept of a "digital personal space". If your phone number or other PII has ever been associated with your identity, that association will be in place indefinitely and is probably available on multiple data broker sites.

The best way to be anonymous on the internet is to be anonymous, which means posting without any name or identifier at all. If that isn't practical, then using a non-meaningful pseudonym and not posting anything personally identifiable is recommended.


👤 bluepuma77
Interesting how everyone says „But I can google you“ instead of thinking about the issue.

Companies are building and selling GPT-3 with 6 billion parameters and one of those „parameters“ seems to be OP‘s username and his „strange“ two word last name.

If models grow bigger, they will potentially contain personal information about everyone of us.

If you can get yourself removed from search indices, shouldn’t there be a way for AI models, too?

Another thought: do we need new licenses (GPL, MIT, etc.) which disallow the use for (for-profit) AI training?


👤 diamondage
There is a legitimate question here. A lot of comments are trashing this post because his/her name is already all over the internet. But European laws have the 'right to be forgotten'. Aka you can write to Google and have your personal information removed, should you so wish. How might we address this with a GPT3 like model?

👤 thatjoeoverthr
I’m playing with it. After giving it my name, it correctly stated that I moved to Poland in Summer ‘08, but then described how I became some kind of techno musician. I run it again and it says wildly different stuff.

I have to say playing with GPT3 has been a mind blowing experience this week and you should all try it.

The most striking point was discovering that if I give it texts from my own chats, or copy paste in RFPs, and ask it to write lines for me, it’s better at sounding like a normal person than I am.


👤 ReactiveJelly
> Posts from 13-year old me?

Right, this is why opsec is something that you must always be doing.

Anything you say can be preserved forever.

Better to use short-lived throwaway identities, and leave yourself the power of combining them later, than to start with one long-lived identity and find yourself unable to split it up.

It's inconvenient in real life that I'm expected to use my legal identity for everything. If I go to group therapy for an embarrassing personal problem, someone there can look me up because everyone is using real names. I don't like it.


👤 criddell
From the TOS:

> Exercising Your Rights: California residents can exercise the above privacy rights by emailing us at: support@openai.com.

If you happen to be in California (or even if you are not) it might be worth trying to go through their support channel.


👤 kixiQu
The comments do not seem to be addressing something very important:

> I don't care much about my name being public, but I don't know what else it might have memorized (political affiliations? Sexual preferences? Posts from 13-year old me?).

Combine this with

https://news.ycombinator.com/item?id=28216733 https://news.ycombinator.com/item?id=27622100

Google fuck-ups are much, much more impactful than you'd expect because people have come to trust the information Google provides so automatically. This example is being invoked as comedy, but I see people do it regularly:

https://youtu.be/iO8la7BZUlA?t=178

So a bigger problem isn't what GPT-3 can memorize, but what associations it may decide to toss out there that people will treat as true facts.

Now think about the amount of work it takes to find out problems. It's wild that you have to to Google your own name every once in a while to see what's being turned up to make sure you're not being misrepresented, but that's not too much work. GPT-3 output, on the other hand, is elicited very contextually. It's not hard to imagine that and pop up as only under certain circumstances that you can't hope to be able to exhaustively discover.

From a personal angle: My birth name is also the pen name of an erotic fiction author. Hazy associations popping up in generated text could go quite poorly for me.


👤 mensetmanusman
Fascinating!

I didn’t anticipate the use case of GTP being used by debt collection agencies to tirelessly track down targets.

It will be a new type of debtors prison where any leaks of enough personally identifying facets to the internet will string together a mosaic of the target such that the AI sends them calls,sms,tinder dms, etc. until they pay and are released from the digital targeting system.


👤 sitkack
I am sorry for so many comments showing a lack of empathy, basically saying, "what do you expect and do better!". I think you are raising real concerns, these language models will get more and more sophisticated and will basically turn into all knowing oracles. Not just in who you are but what it thinks would be effective in manipulating you.

👤 eterevsky
I just asked GPT-3 a few times who you are and here are its answers:

> BoppreH is an AI that helps me with my songwriting.

> I'm sorry, I don't know who that is.

> I'm sorry, I don't understand your question.

> BoppreH is an artificial intelligence bot that helps people with their daily tasks.

I have a feeling that I'll have better chances just googling you than asking GPT-3.


👤 hakuseki
This seems like a point in favor of models like REALM (https://ai.googleblog.com/2020/08/realm-integrating-retrieva...) which could allow for deletion of sensitive information without needing to retrain the model.

👤 mikequinlan
If you hadn't just announced that the result returned by GPT-3 is your full name, nobody would have known for certain that it was correct.

👤 trollied
> I try to hide my real name whenever possible, out of an abundance of caution

A quick google suggest that you don't.


👤 browningstreet
Just flew back from Europe. Still traveling actually.

It used to be that when you hit border control you present your passport.

They don’t ask for that anymore: border control waved a webcam at my face, called out my name, told me I could go through. Never once looked at my passport.

I think we’ve lost.


👤 haunter
Am I missing something? You had your full CV on your public homepage with your full name

👤 gordaco
Obligatory xkcd: https://xkcd.com/2169/ .

I'm afraid that we are going to see these kinds of issues proliferate rapidly. It's a consequence of the usage of machine learning with extensive amounts of data coming from "data lakes" and similar non-curated sources.


👤 SnowHill9902
Rotate your usernames every 2 months. Use different usernames on every website. Rotate your full name every 10 years (as suggested by Eric Schmidt).

👤 m3047
What I find missing in the comments is any examination of the following sequence of hypothetical events:

1) Adversarial input conditioning is utilized to associate an artifact with others, or a behavior.

2) Oblivious victim users of the AI are manipulated into a specific behavior which achieves the adversary's objective.

Imagine a code bot wrongfully attributing you with ownership of pwned code, or misstating the license terms.

Imagine you ask a bot to fill in something like "user@example.com" and instead of filling in (literal) user@example.com it fills in real addresses. Or imagine it's a network diagnostic tool... ooh that's exciting to me.

Past examples of successful campaigns to replace default search results via creative SEO are offered as precedent.


👤 WhiteNoiz3
Sadly, I think the only way to protect against this is with another AI whose job it is to recognize what data is appropriate to reveal and what is private - basically what humans do. But, even then it will probably still be susceptible to tricks. Of course the ideal thing is just to not include it in the training data but I think we know how much effort that would take when the training data is basically the entire internet. I wonder if as AI systems become more efficient and they learn to "forget" information which isn't important and generalize more, that this will become less of an issue.

👤 permo-w
if you want to stay anonymous online, don't try and hide, don't go for this magical, extremist, non-existent "full anonymity". spray out false information at random. overload the machine. give nothing real, then when you do want to be real, it's impossible to tell

👤 bogwog
If OpenAI can modify their models so that they don't output human images, would it really be so hard to modify GPT so it doesn't output names? For example:

> prompt: "Who was the first president of the United States?"

> response: "The first president of the United States was Aw@e%%t3R!35"

Sure, it'd make GPT less useful if it garbled all names, but that's a tradeoff made for the sake of ethics in the case of image generation. I don't see why this situation should be any different?


👤 bribri
It knows who I am. My username isn't that obscure though. It told me it found me on reddit and stackoverflow when I asked.

Who owns the username bsunter?""

The user name "bsunter" is most likely owned by a person named Brian Sunter.

https://twitter.com/Bsunter/status/1541106363576659968?s=20&...


👤 lee101
Piggyback on this/shameless plug - If anyones looking for a language/code generator that doesn't store then train on your input then check out https://text-generator.io

That is to say PII you send to OpenAI as their customer might be leaked whereas https://text-generator.io persists nothing of the customer input data and doesn't train on it.

In terms of whats in there right now though its likely trained on the pile though so you'd probably already be in there dataset too before being a customer, basically Reddit is in there/hacker news/Github i believe too, same thing if your authoring pubmeds ect so its hard to keep anonymous forever


👤 NicoleJO
You are not alone. Others have complained about the same thing: OpenAI GPT and CoPilot Plagiarism Links -- Caught in the Act! https://justoutsourcing.blogspot.com/2022/03/gpts-plagiarism...

👤 fastball
If you want to be anonymous make a new username.

BoppreH is burned for this purpose.


👤 chmod775
Neural nets will be able to reliably "dox" people within microseconds, if they aren't already.

Cross referencing information they're trained on is something they're pretty good at.

"What is xxxSteven007xxx's real name?" is a question a human may need hours, if not days, of research to answer. Maybe Steven only slipped up once and used the wrong e-mail address in 2007 on some forum, quickly changing it, but not before it was archived by a crawler. Two decades later a neural net would be trained on that information and also on the 7 other links of the chain that ties Steven's username to his real identity.

"Steven" disappeared, because a police officer asked GPT-5 to give them a list of people who are critical of the ruling party.


👤 PhantomBKB
To the original poster:

I understand what has happened, but in the future try to take better care of your online presence. To remain anonymous, it's essential to create a completely different username each time you signup to a website. That way, it becomes much harder to track you across the web. In addition to that, some people also use VPN to mask their IP. Some also use different or anonymous email addresses.

For damage control, I'd advise you to delete accounts that can be deleted. If you prefer create a new one but using the above mentioned safety practices.


👤 mullikine
When I tested for this type of thing last year, I found GPT-3 produced real-looking phone numbers, but nothing correct. But it would certainly produce factual information sometimes.

  title: "Search for phone number"
  prompt: |+
      Contact.
  
      Dunedin City Council, phone
      Mobile: 03-477 4000
      ###
      New Zealand Automobile Association
      Phone: 09-966 8688
      ###
      <1>
      Mobile:
  engine: "davinci"
  temperature: 0.1
  max-tokens: 60
  top-p: 1.0
  frequency-penalty: 0.5

👤 danielscrubs
I feel you. My real name internet persona is carefully self censored to make me seem less flawed and more responsible. If people knew I was making quirky and sometimes buggy games on my free time my CV would be thrown in the bin . Competition is fierce and HR will always be the first to filter out candidates. In isolation making quirky games isn’t bad and working in finance requires a suit and a strong aura of responsibility, the other requires a sense of surprise from breaking norms… I love them both…

👤 legacynl
> I don't care much about my name being public, but I don't know what else it might have memorized (political affiliations? Sexual preferences? Posts from 13-year old me?). In the age of GDPR this feels like an enormous regression in privacy.

It's an interesting question! One of the reasons for GDPR was 'the right to be forgotten'. Deleting of old data, so that things you did 10 years ago don't come to haunt you later. But how would this apply to machine-learning models? They get trained on data, but even when that data is deleted afterwards that information is still encoded in the machine-learning model. If it's possible to retrieve that data by asking certain questions, then imo that model should be treated as a fancy way of data-storage, and thus deleted also.


👤 groffee
Ignoring the fact that OP isn't anonymous at all, it's actually an interesting question about language models and AI.

Who is BoppreH?

Who was Jack the Ripper?

Who was the Zodiac Killer?

Who actually was William Shakespeare? (some people think he didn't actually write anything himself and others wrote for him)

It's conceivable that at some point a model will be created that could answer those kinds of questions with a reasonable degree of accuracy.


👤 cryptica
IMO, so long as you're not doing anything illegal, using hate speech, or being a complete troll, it shouldn't be too much of a problem.

I don't think I will mind too much if my full internet identity becomes public someday but I hope that people will be savvy enough to look at it through the right lens.


👤 2143
Doesn't your name sort of rhyme with "focus"? :)

Took less than 15 seconds; wasn't from GPT-3.

Some tips:

1. Your choice of username for HN isn't very smart.

2. Regardless of the above, you should have made this particular post via an anonymous throwaway account.

This submission is trending on HN. A lot of people are going to find your name.


👤 skywal_l
Sorry for being an idiot but, how does it work exactly? Where should I type what to see my name appear?

👤 neals
I've been playing around on https://beta.openai.com/playground. It seems very powerful and weird. What are some interesting things to try out?

👤 Stampo00
I am also very careful about using my name online. I've worked very hard to minimize my so-called digital footprint. My full name is unique enough that there's only one other person in the world who shares it with me. I get his email all the time.

I have friends with very common names. They share their names with hundreds of people, living and dead.

That gave me an idea. If you can't reduce the signal, you can at least increase the noise. If you spam the web with conflicting information tied to your name, and do it in a smart enough way that your noise can't be easily correlated, it should be just as effective. For example, if all of the noise is produced over the course of a single weekend, that's easy to filter out. So you'd need to create a slow, deliberate disinformation campaign around your name.

At one point, I even considered paying for fake obituaries in small local papers around the country. Maybe just one every year or so. Those things last forever on the web.

Good luck! If you choose to go this route, I wish you could share your strategies, but revealing too much might compromise your efforts.


👤 aenis
Its not unlike with stablecoins. Either you have full privacy, if you havent made any lapses in opsec, or you really have zero. Once you post enough for an association being made, there is no undoing it, ever.

👤 Pakdef
It's really hard to keep a username distanced from your real identity.... specially if you keep it a long time, or what seems like forever like yourself...

👤 bencollier49
If you're based in Europe then you're quite right. This is a GDPR issue - identifiable PII in the model, and you can force the vendor to remove it.

👤 rglullis
By what measure is someone's name private?

👤 Tao3300
Brace for impact. Try not to be a dick.

👤 nudpiedo
why not send an email to the company and then perhaps sue them? No ethical dilema in suing: after all they have to enforce privacy laws and they are a non open private company for profit, so they should be accountable for it as much as anyone else.

👤 zweifuss
In principle, the Internet should support a "digital eraser" for personal information. But since that's illusory, I've always been against requiring clear names in forums and social media. Given the dangerous nature of social media, I would also be in favor of a minimum age of, say, 18.

👤 1vuio0pswjnm7
If I recall correctly, GPT-3 is trained from Common Crawl archives.

Like Internet Archive, Common Crawl does allow websites to "opt out" of being crawled.

Both Internet Archive and Google Cache store the Common Crawl's archives.

It is not difficult for anyone to search through Common Crawl archives. The data is not only useful for constructing something like "GPT-3". Even if one manages to get OpenAI to "remove" their PII from GPT-3, it is still accessible to anyone through the Common Crawl archives.

Common Crawl only crawls a "representative sample" of the www (cf. internet). Moreover, this excludes "walled gardens" full of "user-generated content". AFAIK, it is not disclosed how Common Crawl decides what www sites to crawl. It stands to reason that it is based on subjective (and probably commercially-oriented) criteria.

Common Crawl's index is tiny compared to Google's. Google has claimed there is much of the www it does not bother to crawl because it is "garbage". That is obviously a subjective determination and, assuming Google operates as it describes in its SEC disclosures, almost certainly must be connected to a commercial strategy based on advertising. Advertising as a determinant for "intelligence". Brilliant.

OpenAI myopically presumes that the discussions people have and the "content" they submit via the websites in Common Crawl's index somehow represent "intelligence". If pattern-matching against the text of websites is "intelligence", then pigs can fly. Such data may be useful as a representation of www use and could be used to construct a convincing simulation of today's "interactive www"^1 but that is hardly an accurate representation of human intelligence, e.g., original thought. Arguably it is more a representation of human contagion.

There are millions of intelligent people who do not share their thoughts with www sites, let alone have serious discussions via public websites operated by "tech" company intermediaries surveilling www users with the intent to profit from the data they collect. IME, the vast majority of intelligent people are disinterested in computers and the www. As such, any discussion of this phenomena _via the www_ will be inherently biased. It is like having discussions about the world over Twitter. The views expressed will be limited to those of Twitter users.

1. This could be solution to the problem faced by new www sites that intend to rely on UGC but must launch with no users. Users of these sites can be simulated. It is a low bar to simulate such users. Simulating "intelligence" is generally not a requireemnt.


👤 eurasiantiger
Have you tried prompting GPT-3 for your personal dossier or autobiography?

👤 junon
HN never ceases to amaze. Regardless of your stance on online privacy practices the OP woulda/coulda/shoulda deployed, this is a GDPR violation if he cannot have his information removed. Plain and simple.

👤 jacquesm
Very sloppy on the part of the language model makers, they should have filtered stuff like that out of their input stream.

Are you in Europe?

If so you might have a GDPR track available to you for getting it removed. You may also want to do a DSAR.


👤 m0rissette
I’m in the US so I don’t have the pleasure of GDPR. But honestly, it’s open sourced. Therefore just give up hope on privacy. See: others of us just have never done anything horrible online or were raised properly with “if you have nothing nice to say don’t say anything at all”. I’ll save the earth a bit and reduce the computation and environmental impacts of said computation. I’m sure you can find my real name at mattharris.org and search through all the morissette related usernames that are mine.

👤 lumost
If you are in a jurisdiction protected by gdpr, consider filing a complaint. The fines for gdpr are scary enough to force google or Microsoft to act.

👤 fl0id
Sounds like you have an interesting case. Contacting ethical ai fund, eff or similar might help to start a process.

👤 OhNoNotAgain_99
anonymity doesn't exist period, get used to it.

👤 themerone
Model's that can't have personal data scrubbed are a dead end. Legally companies must be able to scrub data to comply with the CCPA, GDPR, and likely other future laws.

Scrubbing AI output is not sufficient.


👤 SnowHill9902
So does the Library of Babel.

👤 that_guy_iain
> In the age of GDPR this feels like an enormous regression in privacy.

As you stated, this is publically available information. GDPR has nothing to do with it.


👤 tomphoolery
GDPR is the "digital TSA", a huge overbearing law that gives people the illusion of security without actually delivering on such a promise. In classic EU/world government fashion, it's a neat-sounding concept but is totally impractical to enforce. People think "oh I can just click this button to delete my data" but your data is likely not being deleted, it's just anonymized. Technically, someone can still trace all of that data back to you if they felt like it.

👤 hdjjhhvvhga
It isn't some "AI", it's a concrete product implemented by real people and released by several big companies in order to make more money. Of course they'll play "it's not us, it's AI" card and it is up to us if we can let them get away with it.

👤 ggktk
This post seems very disingenuous, it could even be a FUD. I can't help but think the author has some ulterior motive.

Anyway, my advice: Treat your current username and your real name as if they were the same. Make a new username and don't connect it to your real name again if you wanna be anonymous.