HACKER Q&A
📣 agonz253

Is GPT 4's quality lately worst than GPT 3.5?


Has anyone else encountered this phenomenon lately? I've found myself prompting GPT 3.5 with simple questions that GPT 4 provided an incorrect answer for, and lo and behold I get a much better answer.

For ex this is GPT 4: https://chat.openai.com/share/e24501ad-8f1c-4b5a-a6d0-d933f5d1d209

And this is GPT 3.5: https://chat.openai.com/share/b9372bdc-ffff-4655-bee4-2b3f3c3b8285

In the latter case I didn't even need to ask for the order by clause as it anticipates it and provides an answer for it. GPT 4's first answer was wrong.

In the past two days I've seen at least 2 other cases where GPT 4's answer was plain wrong and GPT 3.5's was not only correct but of very high quality, reminding me of what I first felt when using GPT 4 for the first time.


  👤 cowthulhu Accepted Answer ✓
I’ve found for tasks that require reasoning (or the illusion of reasoning), GPT4 continues to be much, much stronger than GPT3.5 - especially around handling unexpected inputs, determining the intent behind the instructions and applying that (instead of just following the instructions to the letter), and complex problems that require multiple steps of reasoning.

👤 tmpz22
Its almost as if OpenAI used VC resources to artificially boost their products efficacy during early hype and over time we're seeing the product degrade as the company optimizes for efficiency and profitability and removes its subsidies.

Like AirBnB. Or Uber. Etc.


👤 briga
One thing to note when making comparisons like this is that LLM output is not deterministic, in the sense that if you ask it the same question 10 times you will get 10 different answers. So the question to ask is not, “is GPT4 better on this one specific question?”, but rather “does GPT4 produce better results on average?”. I would bet that it does, for no other reason than that it is much larger, and LLM performance seems to just scale with size. Also worth noting is that the more detailed your prompt is the better the response will be. Sometimes you have to encourage GPT to get the best results. GPT4 should be able yo handle much more complex and detailed prompts than 3.5

👤 tutfbhuf
For all that have the "feeling" that GPT4 has become worse. Please try Bing Chat in creative mode, it seems that this is using some old GPT4 March Version + Bing Search. In my opinion this is much better than the current Chat GPT4 version from OpenAI.

👤 seanhunter
You will get a much more objective comparison if you use the openai api and set the temperature parameter to zero. You may need to use the azure version of gpt-4. At least for me only gpt-3.5 is available on the “real” openAI vs gpt-4 is definitely available for me on azure.

👤 ezedv
In my case, it's gotten better for me!

Furthermore, for those interested in GPT Development, we at Rather Labs are proud to be at the forefront! https://www.ratherlabs.com/gpt-development


👤 kromem
Yes, and I'm willing to bet that within 12 months we'll be looking back realizing that this was due to the fine tuning taking the world's SorA pretrained model aligned with "completing human tax" and putting it in the box of "you are an AI without feelings or desires tasked with XYZ."

The search space on the fine tuned GPT-3.5 chat models versus the foundational Davinci text completion model is MUCH more narrow, particularly in starting off.

Even with the same temperature, you'll see any marketing-style prompt for chat begin with "Introducing XYZ..." around 30% of the time as if it's a junior door to door salesman, whereas the foundational model doesn't have any single intro that common across runs and generally employs a much broader vocabulary set.

We saw Google shoot Lambda in the foot after Blake's press tour which set them behind the next round of competition.

Now we're watching OpenAI snatch defeat from the jaws of victory out of anxiety around oversight and articles like 'Sydney' interviewed by the NYT.

For anyone following along in the 100 million+ training space, maybe don't overreact to press overreactions that will blow over in months as users get hands on experience or you'll blow your lead and waste massive amounts of resources and time.

This was a "user education" issue and not a "handicap your product" issue, in both cases.


👤 muzani
My observation (ChatGPT and not the API models):

For code, 3.5 is superior. 3.5 allows for about 21k tokens of input, while ChatGPT 4 allows for around 10k. This also makes it a lot better for boilerplate work as at it can take a lot more input, and handles long conversations and iterations better.

Brainstorming, 4 is better. It's capable of some top tier brainstorming and it argues back quite frequently.

Unguided creative writing (describe a potato), they're roughly equal.

Guided creative writing (i.e. write a story around (400 words of requirements)), 4 is much better.

Poems and wordplay, 4 absolutely floors 3.5. Wider vocabulary and it's able to do rhymes and alliterations better, which humans are usually bad at.

For reasoning and riddles, 4 is still benchmark among the LLMs.

I really dislike that they named it GPT-3.5 instead of something like "Glide". It implies that it's inferior to 4, when they're just suited for different things.


👤 Exuma
For the very FIRST time about 2-3 days ago I got my first answer that felt like all the baggage that would come from a human or someone telling you “no”

I asked for a fairly simple regression in R for some simple data and it gave me quite literally the lowest quality answer and then told me I need to seek out a data scientist for the answer. I couldn’t believe that.

One of the best parts about AI is not hearing pedantic or bullshit reasons why the thing you’re asking is wrong, isn’t possible, isn’t ideal, blah blah blah

I can’t wait for something way better I’m tired of this watered down nonsense


👤 muzani
There's a whole megathread on this on the OpenAI forums: https://community.openai.com/t/gpt-4-has-been-severely-downg...

👤 porkbeer
Around may quality seemed to tank. I find 'free' gpt just as good or better since I cancelled my subscription. It was glaringly clear that they nerfed it for whatever reason. I'll keep my tinfoil thoughts to myself.