With that, we know GPT-4 is better but... not perfect? I highly doubt GPT-4 "hallucinates" 0% of the time.
Worst case scenario/devil's advocate: Given the fact that the output can't be trusted due to relatively low-accuracy, this means we can't start replacing tedious/monotonous tasks that require humans with AI yet if we value the output being correct (which... we do). If a human has to validate the LLM output, that might take as much time (or more) as having done the task in the first place.
Don't get me wrong, I love talking to ChatGPT for all sorts of reasons. Plus, I see a ton of open-source non-commercial options popping up. Is it safe to assume those are almost always going to be lower quality than GPT-4 (which is sort of like the golden expensive standard, right?)
Curious everybody's thoughts. Not looking to start some massive "bash LLM/undeprlay the achievement". Just looking for... why are so many people obsessed with creating so many GitHub projects if at the end of the day the output isn't worth much? They are partially noise/garbage output generators, no?
And even if you can't do that now, there's a chance it might work in the future, so you can legit tell your investors your plan is to drastically lower costs starting around two to four years. And it's still considered plausible, so when we're both sued by our investors for fraud in four years, you can legit make the claim you acted in good faith and on advice from people considered "experts."
But to keep up the charade, we must blatantly ignore any bad result and brand any academic with the temerity to get in the way of my startup earning me a bazillion dollars as a crank.
So... are you a crank or do you want to join "team bazillionaire" and ride this cash cow to the bank!?!?
Remember... it's not a sin to be lied to.
Thing with the humans is that when they don't know, they know they don't know and if they're honest and upright, they say so that they don't know and then you know that they don't know. Not so with LLMs. When they don't know and say I'm a AI model I don't know XYZ, it has been trigerred by a chain of if Statements meant as guardrails.
Otherwise it'll continue to "hallucinate" the kind of behavior if a human exhibits, we call them a habitual liar.
LLMs and cousins are however a productivity booster for the experts. They can produce something that an expert use as foundation, tweak it further on their own.
I mean, I get a lot of the moral arguments against it - particularly where image generation is concerned - but we're well past the point of wondering whether it's even actually useful, it just isn't useful for everything at the moment. But yes, it is useful. It doesn't have to be perfect - humans aren't perfect. Humans "hallucinate" working code and truthful data all the time. But something that's adequate for a lot of tasks and which will get you most of the way there for a lot of other tasks is still useful.
Edit: I asked it to generate an svg image of a smiling cat face against a white background and... it wasn't so great. I asked for one of the Nike logo and it seems to have hallucinated an external link to a wordpress site. I told it to only use valid urls and it gave me this:
So yeah, not bad.
It's not that it's always correct in all that it says, and you definitely need sometimes call its bullshit what it's contradicting itself or whatever; that's fine. I've had it invent libraries and conflate data structures a few times so it's far from perfect.
I find it works best when you're familiar with the domain, or when you're asking basic questions about something you're not familiar with. It's not making me a 10x more productive developer or anything like that, but it's absolutely been helpful, significantly more so than Google recently.
(I like it so much I'll shill for it even though I technically operate a competing search service)
In particular it has been established that MTurk-like workers (unless you hire bonafide "elite", expert-level ones, which in general doesn't scale) make lots of and lots of mistakes. It's just a question of what your acceptable noise level is to turn a dollar on whatever task needs to be done. So if for the "dumb" 80 percent of use cases for which a GPT-like system can outperform an average MTurk, on flat, accuracy-per-dollar-spent terms -- that's what the route that most businesses will ultimately take.
Am just guessing on the "80 percent" of course, but from all that's been published in recent months (and putting my finger to the wind) -- that's where these systems seem to be heading.
This surfaces a lot of useful feedback, especially if I constrain the model to list 3 specific places where the text reads as jargon-y or 3 specific points I could address to make the text more accessible.
It’s a good example of where specialized/task specific data prep and application/pipeline development with a good UX actually does add value. My pov as someone currently developing an applied ml product and having worked on ml for the last decade or so, is to think of these new LLMs as very good general purpose natural language processors for some use cases or natural language generators for others. That means your product needs to be one that can handle errors well (hallucinations as well as misclassifications or bad surface form formulations) and still be useful for customers. Many times this takes thinking through a novel approach to the UX, not just plugging in another chatbot interface. To me, it’s like the chatbot craze from a few years ago where there are definitely use cases where this works but many that simply don’t even if the model was near perfect.
GPT-3.5 is actually stable enough with some training. You can use it on the typical grunt stuff: formatting text, sorting a pile of data into tables, writing boilerplate code.
It converts an API to code that accesses the API perfectly fine, with a lower rate of mistakes than I make. But you need to give it examples. I use it as a form of metaprogramming, where I give it instructions and it spits out the code with caching and all, better than what Copilot can do.
As a rule of thumb, treat GPT 3.5 as a worker with a thousand years of experience and an IQ of 80. GPT-4 is more like one with an IQ of 130; you don't have to prompt engineer and all but it's not omniscient.
My favorite thing about working with LLMs is that I can go for hours with wild experiments in a way that is very close to my early childhood basement lab experiments with biology, chemistry, and later, computers, and music. It's fun!
To me, anything that benefits from pairs work, which is everything from the art of computer programming to literate programming to research to sensemaking, things that give rise to reflection or the need for a feedback loop like design.
There's learning in working with a partner who's never quite right the first time, just like me. And so we riff back and forth and I get to decide when we're done. Sometimes that's a couple riffs, sometimes it goes on for hours. It's absorbing in a good way - a way that I thought was lost from computing when everything became commercial and social. If I didn't know better, I'd think the ghost in the machine was Alan Kay or Bret Victor.
To have found a partner in my aimless pairing exercises has been the best discovery I've made in a decade.
Here's a satirical example of one of my recent art experiments. I was surprised since I had no expectations. YMMV.
https://davidwatson.org/write/walt_whitman_agile_poet.html
I've got dozens of others laying around, but not enough time to publish them all. Obviously, LLM is not going to make me into an artist. I'm just enjoying spending time with emergent art grown by human-computer partners.
For someone like me whose default mode is definitely divergent, it's fun to have a responsive computer program that has the same kind of responsive cadence or volley I expect from one of my musician friends trading fours.
I use a language learning app that has short passages written at various levels of difficulty, but they don't have any discussion questions, so I just threw the text into ChatGPT and had it write some, and they came out great.
You might also peruse the chatgpt subreddit, although there are a lot of people just trying to troll or hack it, there are also some interesting use cases popping up.
Honestly this reminds me of the conversation around home computers from several decades ago. Human creativity is wild but takes a beat.
The same applies to discussion forums like this one.
Or, OP can proof read the output of these tools and decide what to use and what not to use.
Last night GPT 4 gave me a hallucinated SCOTUS precedent. Because I checked it, I did not embarrass myself.
Here is the Recycled Electrons Rule: Assume everything coming from GPT 4 is coming from a Reddit user with a stoner handle. Just because StonerSmurf73 says it, does not mean I believe it without checking. (I made up that stoner user name.)
> If a human has to validate the LLM output, that might take as much time (or more) as having done the task in the first place.
I also disagree that validation is the always same amount of work as creation. In certain scenarios, I can verify the answers with less work. I usually ask ChatGPT to get an answer first and use google for validation.
Edward de Bono, the person who coined the term lateral thinking, lists random juxtaposition as one of the tools for spurring creativity. [1], [2]
In his Six Thinking Hats [2], he writes about 6 different modes of thinking coded by colored hats.
- Blue hat: Organizing
- White hat: Information, facts, stats
- Green hat: Creative ideas
- Yellow hat: Benefits and positives
- Black hat: Caution, risks, negatives
- Red hat: Emotions
He asks us (a team) to look at a problem (he calls it Focus. For e.g., Focus: AGI in military use) wearing one hat at a time. So, when we all wear White hat, we bring data, citations, previous relevant work etc., We don't expend energy in evaluating this data at the moment (that comes later when we wear a different hat, i.e., the black hat).
His theory is that we can think better with the Six Thinking Hats method.
So, applying this analogy to LLMs, hallucinations of LLMs can be thought of as the LLM wearing a green hat.
A theorem prover or fact checker can be added to act as a black hat. (LLMs themselves are capable of doing this critical review -- for eg., list 5 points for and against fossil fuels).
Extending this analogy further, we have tools like LangChain [3] that are focused on the organizing bit (blue hat), ChatGPT plugins that provide up-to-date information, run computations or use 3rd party services (white hat).
Green and Yellow hats are out-of-the-box supported by LLMs already.
Red hat is a sentimental analyzer (which is a classic machine learning algorithm) that LLMs already subsume.
So, it is just a matter of time before this gets refined and more useful that we don't have to worry about the hallucination coming in the way.
[1]: https://www.amazon.com/Serious-Creativity-Thoughts-Reinventi...
[2]: https://www.amazon.com/Six-Thinking-Hats-Edward-Bono/dp/0316...
I think there are a several reasons the projects seem overly ambitious right now. First, a lot of people think the tech will improve exponentially in the next 12 months (I have no idea or opinion on if that's true, but the sentiment is definitely out there). They think that even if the output of the project is crappy today, improvements in ChatGPT will make it worthwhile soon.
Second, a lot of people are working with ChatGPT just to get familiar with it and learn about LLMs or ML in general. The point isn't to make a project with a useful application, but to learn from making the project that they did or put a ML project in their portfolio.
Third, for a lot of people getting a starting point is worthwhile. If you find it easier to go in and change a paper that someone else drafted or change boilerplate code to fit your needs, then having some ChatGPT output as a starting point has value (even if it's a crappy starting point).
Fourth, for a lot of people the expected quality of the output is surprisingly low. I've talked to several people who work in marketing that are using ChatGPT to right bumps for products. That text almost doesn't matter because people aren't really reading the paragraph that promotes a spatula on Amazon or a B movie they've never heard of on Netflix. It doesn't seem like a big difference, but if you sit down to write 10 of those paragraphs, you'll find it's less effort to ask ChatGPT to do it and then just proof read the output.
And I think you'll be surprised on where quality matters very little. It's not hard to imagine a company using ChatGPT as the first line of customer support via email or web chat. It can certainly dispense canned troubleshooting steps well. Just have the AI report the tone of the customer's responses and escalate them to a human if they get too upset or argumentative. If the AI said something that was nonsense, the human can apologize and say that they'll "speak with the employee" that the customer was working with before. You already can't trust the output of humans in jobs where you're paying them too little to actually care, so why not use an LLM instead? If the cost of the LLM plus the freebies you give out to appease the customers it makes mad is lower than the cost of hiring humans...