For ex this is GPT 4: https://chat.openai.com/share/e24501ad-8f1c-4b5a-a6d0-d933f5d1d209
And this is GPT 3.5: https://chat.openai.com/share/b9372bdc-ffff-4655-bee4-2b3f3c3b8285
In the latter case I didn't even need to ask for the order by clause as it anticipates it and provides an answer for it. GPT 4's first answer was wrong.
In the past two days I've seen at least 2 other cases where GPT 4's answer was plain wrong and GPT 3.5's was not only correct but of very high quality, reminding me of what I first felt when using GPT 4 for the first time.
Like AirBnB. Or Uber. Etc.
Furthermore, for those interested in GPT Development, we at Rather Labs are proud to be at the forefront! https://www.ratherlabs.com/gpt-development
The search space on the fine tuned GPT-3.5 chat models versus the foundational Davinci text completion model is MUCH more narrow, particularly in starting off.
Even with the same temperature, you'll see any marketing-style prompt for chat begin with "Introducing XYZ..." around 30% of the time as if it's a junior door to door salesman, whereas the foundational model doesn't have any single intro that common across runs and generally employs a much broader vocabulary set.
We saw Google shoot Lambda in the foot after Blake's press tour which set them behind the next round of competition.
Now we're watching OpenAI snatch defeat from the jaws of victory out of anxiety around oversight and articles like 'Sydney' interviewed by the NYT.
For anyone following along in the 100 million+ training space, maybe don't overreact to press overreactions that will blow over in months as users get hands on experience or you'll blow your lead and waste massive amounts of resources and time.
This was a "user education" issue and not a "handicap your product" issue, in both cases.
For code, 3.5 is superior. 3.5 allows for about 21k tokens of input, while ChatGPT 4 allows for around 10k. This also makes it a lot better for boilerplate work as at it can take a lot more input, and handles long conversations and iterations better.
Brainstorming, 4 is better. It's capable of some top tier brainstorming and it argues back quite frequently.
Unguided creative writing (describe a potato), they're roughly equal.
Guided creative writing (i.e. write a story around (400 words of requirements)), 4 is much better.
Poems and wordplay, 4 absolutely floors 3.5. Wider vocabulary and it's able to do rhymes and alliterations better, which humans are usually bad at.
For reasoning and riddles, 4 is still benchmark among the LLMs.
I really dislike that they named it GPT-3.5 instead of something like "Glide". It implies that it's inferior to 4, when they're just suited for different things.
I asked for a fairly simple regression in R for some simple data and it gave me quite literally the lowest quality answer and then told me I need to seek out a data scientist for the answer. I couldn’t believe that.
One of the best parts about AI is not hearing pedantic or bullshit reasons why the thing you’re asking is wrong, isn’t possible, isn’t ideal, blah blah blah
I can’t wait for something way better I’m tired of this watered down nonsense