HACKER Q&A
📣 behnamoh

Is it just me or GPT-4's quality has significantly deteriorated lately?


It is much faster than before but the quality of its responses is more like a GPT-3.5++. It generates more buggy code, the answers have less depth and analysis to them, and overall it feels much worse than before.

For a while, the GPT-4 on phind.com gave even better results than GPT-4-powered ChatGPT. I could notice the difference in speed of both GPT-4s. Phind's was slower and more accurate. I say "was" because apparently phind is now trying to use GPT-3.5 and their own Phind model more frequently, so much for GPT-4 powered search engine....

I wonder if I use Poe's GPT-4, maybe I'll get the good old GPT-4 back?


  👤 bbotond Accepted Answer ✓
Yes. Before the update, when its avatar was still black, it solved pretty complex coding problems effortlessly and gave very nuanced, thoughtful answers to non-programming questions. Now it struggles with just changing two lines in a 10-line block of CSS and printing this modified 10-line block again. Some lines are missing, others are completely different for no reason. I'm sure scaling the model is hard, but they lobotomized it in the process.

The original GPT-4 felt like magic to me, I had this sense of awe while interacting with it. Now it is just a dumb stochastic parrot.


👤 i_dont_know_
To me, it feels like it's started giving superficial responses and encouraging follow-up elsewhere -- I wouldn't be surprized if its prompt has changed to something to that effect.

Before, if I had an issue with a library or debugging issue, it would try to be helpful and walk me through potential issues, and ask me to 'let it know' if it worked or not. Now it will try to superficially diagnose the problem and then ask me to check the online community for help or continuously refer me to the maintainers rather than trying to figure it out.

Similarly, I had been using it to help me think through problems and issues from different perspectives (both business and personal) and it would take me in-depth through these. Now, again, it gives superficial answers and encourages going to external sources.

I think if you keep pressing in the right ways it'll eventually give in and help you as it did before, but I guess this will take quite a bit of prompting.


👤 YeGoblynQueenne
My guess is that -probably no. It's more likely you had a stream of good luck in your earlier interactions and now you're observing regression to the mean.

That can easily happen and it's why, for example, medical studies, are not taken as definitive proof of an effect.

To further clarify, regression to the mean is the inevitable consequence of statistical error. Suppose (classic example) we want to test a hypertension drug. We start by taking the blood pressure (BP) of test subjects. Then we give them the drug (in a double-blind, randomised fashion). Then we take their blood pressure again. Finally, we compare the BP readings before and after taking the drug.

The result is usually that some of the subjects' BP has decreased after taking the drug, some subjects' BP has increased and some has stayed the same. At this point we don't really know for sure what's going on. BP can vary a lot in the same person, depending on all sorts of factors typically not recorded in studies. There is always the chance that the single measurement of BP that we took off a person before giving the drug was an outlier for that patient, and that the second measurement, that we took after giving the drug, is not showing the effect of the drug but simply measuring the average BP of the person, which has remained unaffected by the drug. Or, of course, the second measurement might be the outlier.

This is a bitch of a problem and not easily resolved. The usual way out is to wait for confirmation of experimental results from more studies. Which is what you're doing here basically, I guess (so, good instinct!). Unfortunately, most studies have more or less varying methodologies and that introduces even more possibility for confusion.

Anyway, I really think you're noticing regression to the mean.


👤 tmikaeld
There's no doubt that it's gotten a lot worse on coding, I've been using this benchmark on each new version of GPT-4 "Write a tiptap extension that toggles classes" and so far it's gotten it right every time, but not any more, now it hallucinates a simplified solution that don't even use the tiptap api any more. It's also 200% more verbose in explaining it's reasoning, even if that reasoning makes no sense whatsoever - it's like it's gotten more apologetic and generic.

The answer is the same on GPT plus and API with GPT-4, even with "developer" role.


👤 35997279
It’s been mostly fine for me, but overall I am tired of every answer having a paragraph long disclaimer about how the world is complex. Yes, I know. Stop treating me like a child.

👤 kypro
The researchers who worked on the "sparks of AGI" paper noted that the more OpenAI worked on aligning GPT-4 the less competent it became.

I'm guessing that trend is continuing...


👤 chaxor
The reason it's worse is basically because it's more 'safe' (not racist, etc). That of course sounds insane, and doesn't mean that safety shouldn't be strived for, etc - but there's an explanation as to how this occurs.

It occurs because the system essentially does a latent classification of problems into 'acceptable' or 'not acceptable' to respond to. When this is done, a decent amount of information is lost regarding how to represent these latent spaces that may be completely unrelated (making nefarious materials, or spouting hate speech are now in the same 'bucket' for the decoder).

This degradation was observed quite early on with the tikz unicorn benchmark, which improved with training, and then degraded when fine-tuning to be more safe was applied.


👤 Jack000
I tried to replicate a few of my chats (the displayed date is incorrect, it seems to be the publish date instead of the original chat date):

svg editor:

early april: https://chat.openai.com/share/c235b48e-5a0e-4a89-af1c-0a3e7c...

now: https://chat.openai.com/share/e4362a56-4bc7-45dc-8d1b-5e3842...

originally it correctly inferred that I wanted a framework for svg editors, the latest version assumes I want a js framework (I tried several times) until I clarify. It also insists that the framework cannot do editable text until I nudge it in the right direction.

Overall slightly worse but the code generated is still fine.

word embeddings:

early april: https://chat.openai.com/share/f6bde43a-2fce-47dc-b23c-cc5af3...

now: https://chat.openai.com/share/25c2703e-d89d-465c-9808-4df1b3...

in the latest version it imported "from sklearn.preprocessing import normalize" without using it later. It also erroneously uses pytorch_cos_sim, which expects a pytorch tensor whereas we're putting in a numpy array.

overall I think the quality has degraded slightly, but not by enough that I would stop using it. Still miles ahead of Bard imo.


👤 opportune
Is it consistently worse or just sometimes/often worse than before? Any extreme power users or GPT-whisperers here? If it’s only noticeably worse X% of the time my bet would be experimentation.

One of my least favorite patterns that tech companies do is use “Experimentation” overzealously or prematurely. Mainly, my problem is they’re not transparent about it, and it creates an inconsistent product experience that just confuses you - why did this one Zillow listing have this UI order but the similar one I clicked seconds later had a different one? Why did this page load on Reddit get some weirdass font? Because it’s an experiment the bar to launch is low and you’re not gonna find any official blog posts about the changes until it’s official. And when it causes serious problems, there’s nowhere to submit a form or tell you why, and only very rarely would support, others, or documentation even realize some change is from an experiment. Over the past few years I’ve started noticing this everywhere online.

Non-sticky UI experiments are especially bad because at eg 1% of pageloads the signal is going to be measuring users asking themselves wtf is up and temporarily spending more time on page trying to figure out where the data moved. Sticky and/or less noticeable experiments like what this could be have stronger signals but are even more annoying as a user, because there’s no notice that you’re essentially running some jank beta version, and no way to opt back into the default - for you it’s just broken. Especially not cool if you’re a paying customer.

I’m not saying it’s necessarily an experiment, it could be just a regular release or nothing at all. I’d hope if OpenAI was actually reducing the parameter size of their models they’d publicly announce that, but I could totally see them running an experiment measuring how a cheaper, smaller model affects usage and retention without publishing anything, because it’s exactly the kind of “right hand doesn’t know what the left is doing” thing that happens at fancy schmancy tech companies.


👤 sirsinsalot
Given the incoming compute capability from nvidia and the speed of advancement, we have to stop and think ... does it make sense to give access, paid or otherwise, to these models once they reach a certain sophistication?

Or does it make even more sense to hoard the capability to out compete any competitor, of any kind, commercially or politically and hide the true extent of your capability to avoid scrutiny and legislation?

I'm going with the latter. Perhaps now, perhaps in the very near future, the power of these capabilities is novel. Like an information nuclear weapon.

I'd be dialing back the public expectations and deploying the capability in a novel way to exploit it as the largest lever I could.

The more unseen the lever, the longer.

I think any other strategy is myopic from a competition perspective. The power of these models isn't direct utility, it is compounded by secrecy because their useful work isn't directly observable as coming from the model.


👤 kleene_op
This is the normal workflow for drug dealers too.

The first fix is free.

The second one will cost you money.

The third one will be laced with fillers and have degraded quality.


👤 okdood64
Yes!

It didn't even try on my question of Jarvis standings desks, which is a fairly old product that hasn't changed up.. Their typical "My knowledge cutoff..." response doesn't even make sense. It screwed up another question I asked it about server uptime and four-9s, Bard got it right.

I've moved back to Bard for the time being...It's way faster as well. And GPT-4's knowledge cutoff thing is getting old fast.

Example prompt: how much does a 48 x 30 inch jarvis bamboo standing desk setup weigh?

I still have the spec sheet from my order back in 2020. So I can verify the weights.

### GPT-4 ###

ChatGPT As of my knowledge cutoff in September 2021, I don't have the exact weight of a 48 x 30 inch Jarvis Bamboo standing desk setup. However, on average, standing desks can range anywhere from around 60 to 100 pounds or more, depending on the specific configuration (like the choice of frame, desktop material, and any additional features).

Jarvis Bamboo standing desks are made of a bamboo top and an electric lift frame. The weight of the bamboo top for that size should be around 30-40 pounds. The frame itself can weigh anywhere from 50-70 pounds.

For a more accurate weight, I would recommend checking the manufacturer's specifications on their website or contacting their customer support directly for this information.

### Bard ###

The fully assembled Jarvis Bamboo Standing Desk weighs 92 pounds. The desktop itself weighs 38 pounds, and the frame weighs 54 pounds. The desk can hold a maximum weight of 350 pounds.

Here are some of the features of the Jarvis Bamboo Standing Desk:

Electric height adjustment from 25.5 to 51 inches Memory presets for up to 4 different heights 3-stage gas spring mechanism Bamboo desktop with a scratch-resistant finish Black powder-coated steel frame 10-year warranty on the frame and 5-year warranty on the desktop


👤 phillipcarter
I don't think it's any worse at all. I think what most people are expressing here is reaching the limits of the technology and realizing that it's not magic.

👤 londons_explore
You can still run the original gpt-4-0314 model (March 14th) on the API playground:

https://platform.openai.com/playground?mode=chat&model=gpt-4...

Costs $0.12 per thousand tokens (~words), and I find even fairly heavy use rarely exceeds a dollar a day.


👤 layer8
Unrelated to AI, this is a general issue with SaaS: you don’t have any guarantee of a stable functionality and feature set, and the software is predisposed to change under your feet in inconvenient ways.

👤 root_axis
Seems about the same to me and I have been using it daily for several months now for code, Spanish to English translations, and random stuff like film recommendations. The quality remains consistent.

I'm personally of the opinion that the observable jump in quality between 3.5 and 4 inflated people's initial assessment of its capabilities and with continued use they are noticing it's not actually the omniscient machine god that many are so inclined to believe.

Either way, these kinds of posts are meaningless without some kind of objective standard to judge the model by, everyone just sees what they want to see. Despite claims of GPT4 being nerfed, I've yet to see anyone actually show evidence of it. There have been dozens of studies done on its capabilities so this is something that can actually be demonstrated empirically if it's true.


👤 allisdust
Yes. Seems to have definitely gone down. Not sure what they have done but even with things it used have no trouble with, it struggles now. Most likely they are experimenting on reducing the compute per request.

👤 danjc
It also seems to have amped up the qualifier paragraph that's appended for anything deemed contentious. My favourite so far is when I asked it about the name of a video driver from the 80s that would downscale cga to monochrome:

'As a final note, remember that real color rendering on a monochrome screen is physically impossible, as the monitor itself is not capable of producing colors. The best one can do is different shades of monochrome, possibly with differing intensity.'


👤 bob1029
OpenAI's models feel 100% nerfed to me at this point. I had it solving incredibly complex problems a few months ago (i.e. write a minimal PDF parser example), but today you will get scolded for asking such a complicated task of it.

I think they programmed a classifier layer to detect certain coding tasks and shut it down with canned BS. I like to imagine certain billion/trillion-dollar mega corps had a back-room say regarding things that they would really prefer OpenAI's models not be able to emit. Microsoft is a big stakeholder and they might not want to get sued... Liability could explain a lot of it.

Conspiracy shenanigans aside, I've decided to cancel my "premium" membership and am exploring open/DIY models. It feels like a big dopamine hangover having access to such a potent model and then having it chipped away over a period of months. I am not going through that again.


👤 xrd
Reading all the comments here, seems like being able to run your own models is vital. If not, you are subject to a service where the capabilities are changing underneath you constantly and without notice.

👤 Terretta
It's not just you. Here's a bit of research you can cite:

> GPT-4 from its website and Bubeck et al Mar 2023. Note that the version that Bubeck uses is GPT-4 Early which is supposedly to be more powerful than GPT-4 Launch (OpenAI paid a lot of alignment tax to make GPT-4 safer).

https://github.com/FranxYao/chain-of-thought-hub

Anecdotally, there seemed to be a golden set of weeks in late April to early May that seemed like "peak GPT" (GPT-4), followed by heavy topic and knowledge mitigation since, then -- just this week -- adding back some "chain of thought" or "show your work" ("lets go step by step" style) for math. I say anecdotally because I could just be prompting it wrong.


👤 ajani
Yes it is definitely worse. I submitted feedback a few days ago saying exactly what is being said here, that the model responses look like 3.5.

There are also very telling patterns of response that indicate a pre gpt-4 model.

1: All previous models suffered terribly if your chat got too long. After 20 or so responses it would suddenly start to feel less attentive to what is being said and output superficial or incorrect responses.

2: If you stop a chat midway and come back later to continue (after a refresh or a different chat interaction), it would often respond with code or suggestions that have nothing to do whatsoever with your prompt.

Both these patterns are sometimes evident in the current model. Likely then, there is some clamping down on its capabilities.

My suspicion is, this probably relates to computing resources. The 25 messages cap must mean that it’s difficult to scale its performance. And the only way to do so is to simplify the model activations with heuristics. Perhaps analyzing and preprocessing the input to see how much of the model needs to be used (partial model use can be architected).

This seems to be the simplest explanation of observed state and behaviour.


👤 FredPret
Maybe you annoyed it? I’m super nice to it and it performa better than ever.

I ask it lay-of-the-land questions about technical problems that are new to me to detailed coding problems that I understand well but don’t want to figure out.

The best though, is helping me navigate complicated UI’s. I tell it how I want some complicated software / website to behave, and it’ll tell me the arcane menu path to follow.

It’s funny how computing might soon include elements of psychology and magic incantations nobody understands.


👤 jlmorton
Definitely nerfed. Concomitantly, the performance increased substantially, and it now feels much, much quicker (maybe 10x even?), but the quality has decreased quite a bit.

👤 Kalabint
I noticed that it tries to forward the User to external Sources more (Answering Query, and then "For further Info, just ask an Expert"), or tries to get the User to do the Work (Here is a nice Overview of the Program, now you do the rest of the coding).

If i don't want the RLHF to get in my Way, i switch over to the API (sadly not the 4.0 one).

I also noticed a decline in following Instructions, i have a Primer i am preseeding my Chats with.

The Primer ends with "Do you understand? [Y|N]", and ChatGPT 3.5 usually answered with a Summary, ChatGPT 4.0 in the beginning just wrote "Y".

Now it behaves like 3.5, answering with a Summary instead of a "Y". Adjusted the Prompt to -> "Confirm instructions with a short and precise "Ok"." which seems to work.

Used Primer: https://github.com/Kalabint/ChatGPT-Primer


👤 lwansbrough
Yeah 100%. It's much faster now, and I am almost certain they haven't made that much of an improvement in efficiency nor have they scaled it up to be that fast, if that's even how it works.

I think I read when it was released it was 32K tokens, then quickly scaled back to 8K tokens. I'm guessing now that they've further reduced it. Maybe it's 6000 tokens vs. GPT-3.5's 4K? I don't know. But it's certainly noticeably worse at every task I give it.


👤 robinduckett
Yep, whereas before it would generate a cohesive whole class in Typescript from a set of instructions, now it gives me the framework of the class with “// fill out the rest of the class here”. Worse than GPT-3.5. They’re going to lose subscriptions.

👤 jacquesm
You'd think that GPT-X would guarantee some kind of continuity for that particular version so that you can rely on what it does once you've tested it. Having this kind of moving target won't help OpenAI to instill confidence in its product.

👤 krajzeg
Do we have a good, objective benchmark set of prompts in existence somewhere? If not, I think having one would really help with tracking changes like that.

I'm always skeptical of subjective feelings of tough-to-quantify things getting worse or better, especially where there is as much hype as for the various AI models.

One explanation for the feelings is the model really getting significantly worse over time. Another is the hype wearing off as you get more used to the shiny new thing and become more critical of its shortcomings.


👤 cubefox
If it really got faster and worse at the same time, the most likely reason is obvious: They used methods to shrink the model down, in order to lower inference cost. Side effect is higher speed and some loss in quality.

👤 atleastoptimal
It's untenable to simply trust LLM API providers that the models they are serving through an API endpoint is the model they claim it is. They could easily switch the model with a cheaper one whenever they wanted, and since LLM outputs are non-deterministic (presuming a random seed), it would be impossible to prove this.

LLM's integrated into any real product requires a model hash, otherwise the provider of the model has full control over any sort of deception.


👤 thyrox
Yes I've noticed the downgrade too.

If it's a scaling problem they should just rename it to GPT 4+ or something and raise prices than just degrade the experience for everyone. I'm sure a lot of people will happily pay more to get the original quality back than this watered down version.


👤 mckirk
On Saturday it produced grammatically incorrect German text (when prompted in German), which it certainly never had done before. It was quite concerning to see.

👤 arthurofbabylon
Reading the comments in this thread, with the rightful distrust of OpenAI and criticism of the model, it occurs to me that the underlying problem we’re facing here comes down to stakeholders and incentive structures.

Ai will not be a technical problem (nor a solution!), rather our civilization will continue to be bottlenecked by problems of culture. OpenAI will succeed/fail for cultural reasons, not technical ones. Humanity will benefit from or be harmed by ai for cultural reasons, not technical ones.

I don’t have any answers here. I do however get the continuous impression that ai has not shifted the ground under our feet. The bottlenecks and underlying paradigms remain basically the same.


👤 skilled
It’s possible they are trying out a shaved/turbo version so that they can start removing the limits. I mean as it is - 25 messages every 3 hours is useless, particularly for browsing and plugins.

👤 airbreather
it is far less willing to provide code, I get better results and faster out of 3.5.

I am at the point that 4.0 is basically not worth using as single entity, but it seems that using the api and generating some combative/consultative agents yields some interesting results, but not super fast.

Check this out if you have not seen it already : "AutoGPT Test and My AI Agents Effortless Programming - INSANE Progress!"

https://www.youtube.com/watch?v=L6tU0bnMsh8


👤 hugg
Yes, I noticed this too, I fed it some HTML with tailwind classes and told it to just list all the tailwind classes that we use and then the CSS behind those classes.. it just hallucinated all(!) the items in the list (and just gave me a list of 10 seemingly random classes). And then when I did asked something else about the code it had forgotten I had ever pasted anything in the conversation. Very weird.

👤 captainbland
It's interesting to me that previously the consensus considered "the only way is up" with respect to generative models. Now we're seeing performance degradation with GPT, possibly indicating a peak or at least a local maximum.

I hope you're all ready with your pumpkin spiced lattes and turkeys because it seems like we're in AI autumn and we all know what comes after that.


👤 ukuina
I would recommend access over API and using an interface like TypingMind that gives you control over the system prompt for consistency.

👤 throwawayadvsec
I think it's happening because of the extreme content filtering in place, at some point it was even refusing to generate some code because it thought it went against its guidelines to write code.

👤 windex
This is why having opensourced models is important. This is also why a lot of the lobbying around wanting regulation is happening. Imagine this, the plebs get the neutered AI, the people at the top get raw Open AI GPT4+

👤 gtm1260
I find that the GPT-4 in the playground is as good as before - but have noticed the issues when using the chat gpt client

👤 napier
No, not just you. It’s frequently abysmal nowadays. Tragic. And no version change-log or any other info from OAI about what they’ve done to it (probably GPU sparing optimized distillation + overly aggressive PR safety satisficing RLHF) and an advertised weeks long support query lag (I’ve yet to receive a response).

👤 schappim
+1, was doing active dev against it now and saw really stupid responses. For example "transform this unstructured text to valid JSON with no keys with empty strings", would return the keys w/ empty strings for some results.

👤 gwoolhurme
I see these posts popup every now and then. I admittedly don't use GPT4 or chatGPT that often, but I don't notice that much of a difference. Is it possible you try to give it harder and harder tasks and it is failing at them instead of the easier tasks it solved when you used it before? Is it possible it is just scaled back due to over use? Is that possible? It could be a dumb dumb question. In my experience even a few weeks ago, for Swift and Kotlin I found that the outputs of chatgpt and gpt4 are comparably similar and sometimes useless without a good amount of human intervention.

👤 junon
I posted this exact question here a while back about 3.5. OpenAI keeps neutering ChatGPT to find the lowest common denominator of what they think is "acceptable" results, which makes the entire model suffer.

👤 throwaway1249
It's not just you. Probably they added some sort of classifier at the beginning to understand whether they should send it to 3.5 or 4. In my (very opinionated, undocumented, and mostly unscientific) opinion, more complex queries generally hit the old model, with the slow chugging of tokens. For example, I just asked it to refactor a very horrible POC in python that was creeping into the 200 LoC and it did the job wonderful. The prompt was:

`Can you refactor this function to make it: * More readable * Split on different parts * Easy to test

Consider the use of generators and other strategies. (code here)`


👤 boringuser2
I have been using GPT-4 very extensively for the duration of its release.

I am concerned that I can't determine a natural process from a manufactured one.

To clarify, I have become increasingly less impressed with GPT-4.

Is this a natural process? Is it getting worse?

I personally lean towards the hypothesis that it is getting worse as they scale back the resource burn, but I can't know for certain.

As a developer, it still has value to me. GPT 3.5 has value. But, whereas initially it actually made me afraid for my job, now it really is a bit tedious to extract that value.


👤 CyrsBel
I believe they introduced a sort of rate limiting where the expectation will be that the user is doing more thinking and due diligence and asking more precise questions, so that they're not just attempting to get a lot of hand holding with very broad prompts when they can otherwise think about what code they're being given back from more specific and structured questioning. This is useful because it will preserve the value proposition of AI while avoiding a machine just being milked for code by people who want to just throw code at the wall and see what sticks. Milking a model for code is fun in some ways but it won't scale from a cost-to-run point of view, and it also will incentivize no longer thinking deeply and critically about what is being built.

If you are incorporating machine models into your development process very heavily, make sure you're still building deep knowledge and deep context of your codebase. Otherwise you'll atrophy in that department because you falsely believe the machine maintains the context for you.

That isn't feasible right now because the cost of maintaining huge context per user to enable milking rather than enhanced-thinking is too high.

Also consider that we don't want to enable milking. People who have no idea what they're doing will just throw a lot of code at the wall and then when there's a huge mess they'll be asking engineers to fix it. We need to be careful with who has this kind of access on our teams in general. Someone who is non-technical should not be given a firecracker before they've ever even just turned on a microwave or a stove.


👤 ammar_x
Like many people here have noticed, it's definitely less quality now than before. It's annoying to be honest to reduce the quality significantly without a notice while we are paying the same amount. I'm willing to pay $40 for the original GPT-4, though.

GPT-4 was remarkable 2 months ago. It could handle complex coding tasks and the reasoning was great. Now, it feels like GPT-3. It's sad. I had many things in mind that could have been done with the original GPT-4.

I hope we see real competitors to GPT-4 in terms of coding abilities and reasoning. Absence of real competitors made it a reasonable option for "Open"AI to lobotomize GPT-4 without a notice.


👤 dormento
I think it is a combination of issues (from most unlikely to most likely):

- They might've not been prepared for the growth;

- They were prepared, but decided to focus on the 95% of "regular people with easy questions" that treat it as a curiosity instead of the fewer people with difficult questions. Most regular people have no idea who is OpenAI, what a LLM is or how a GPT works, but they know "ChatGPT, the brand". Since it became a household name so quickly, it would be far better that the AI is just a little underwhelming sometimes than for it to not be able to serve so many people.

- The corpus used to generate it was trained on a staggering amount of content. This includes fringe and unmoderated content. Imagine you asked a question about WW2, and being trained, lets say on 4chan, the model responds with a very charitable bias about the solid reasoning behind the reich's actions at the time... It does not look good for investors, for the company, and attracts scrutiny. Even more innocuous themes are enough to invite all kinds of bad faith debate, radical criticism and whatnot... and the darling of the "IA revolution" certainly does not need (or want) coverage outside of their roadmap wishes.


👤 imchillyb
You and I are not the 'end user' of this software. You and I are customers of OpenAI's customers.

The companies implementing ChatGPT are going to restrict -our- access to the APIs and the most relevant conversations.

It is in this manner that both OpenAI and the companies its selling its technologies to will benefit. In this manner OpenAI profits from both sides. You and I are left out though unless we want to pay much more than we're paying now.


👤 fwungy
I recently asked Bing chat to write a Rust function for me. It used to do that well.

This time it wrote the function. I could see it for a second, and then changed it to the message something like "I can't talk about this right now." I don't remember exactly, but the interesting thing was that it flashed an answer and then withdrew it.

Its Rust abilities seem to have deteriorated over the past couple weeks


👤 nailer
dang: a heads up that phind.com is running an Astroturf campaign on Hacker News. They had a previous article with a huge amount of suspicious behaviour last week. https://news.ycombinator.com/item?id=36027302

This is why the headline for this article says GPT4 but the body is focused on mentioning phind.com.


👤 jonathan-kosgei
I actually considered starting a similar thread because I noticed this as well! Lately, it feels like GPT-4 is trying to get out of doing my work for me :D.

I've also started to notice it's been making a lot of typos eg. yesterday while converting Kelvin to Celsius, it incorrectly stated that 0C is -274.15K, despite correctly stating in the previous sentence the correct value of -273.15K.


👤 julianh65
I've used it a lot for making tweaks to react components and one change that I've noticed is that when I used to paste in entire component files and ask for modifications it would reply back to me with the entire file with tweaks and edits. Now it seems to only reply with the tweaked parts and comments in this form

// Your previous code here

foo(bar) // added foo() call here to do xyz

If I were to speculate I would say that this would reduce the amount of work it has to do in that it needs to generate less content and the replies are shorter but I feel like this has a slightly performance loss. I'm not sure exactly why there's a performance loss but I could see it being the case where generating the entire file with specific line edits could allow for better predictions on the code / file versus trying to only reply with the changes needed. I wonder if this is a tweak in the prompt or if the model itself is different.


👤 airgapstopgap
It's time to design a public benchmark for these types of systems to compare between versions. Of course, any vendor who trains on the benchmark should face extreme contempt, but we'd also need to generate novel questions of equal complexity.

Alternatively, there should be a trusted auditor who uses a secret benchmark.


👤 rushingcreek
Phind co-founder here. The way we deployed GPT-4 previously was costing thousands of dollars per day and not sustainable. We’re bringing back a dedicated GPT-4 mode for those with accounts this week. And our goal is for the Phind model to be better than GPT-4 for technical questions.

👤 ChatGPT4BAD
I agree. The quality got MUCH more worse. I'm very disappointed and probably will cancel my subscription. For my use case it got nearly useless...

👤 TradingPlaces
Inference is far more expensive on GPT-4. My take has been the same and I think it’s a cost-saving move. The responses are shorter and less complete than they were just a few weeks ago.

Right now, I don’t think it’s possible to scale a service really big based on GPT-4 because of cost.


👤 xwdv
This is why you can’t rely on GPT entirely for coding, or at all. Imagine you were a company that abandoned all hiring of software engineers and instead used prompt engineers to develop code, then one day the AI just becomes incompetent and quality of output deteriorates. Your prompt engineers no longer can get the AI to fix things or develop new code. Your entire company is snuffed out, overnight. You might try to hire true software engineers rapidly, but they’ve become very expensive and hard to hire. Too much demand too little supply.

You’re screwed, the AI was your most important tool and it’s broken and there’s nothing you can do about it, it’s a black box, you don’t control it.


👤 Jorropo
I went through my history, for one code example I copy pasted the old prompt into new GPT4.

It was about writing a CGO wrapper given an hpp header for a lib I have. Back then it used to give me almost correct code, it understood is had to write a C++ C ffi using extern "C" first because Golang's CGO FFI only support C not C++. And then it generated me correct CGO wrapper with a Go looking type that made sense. The only wrong thing is that it didn't understood it had to call the initialization function of my C++ lib at init time, instead it called it in the New factory function (which would segfault when you build more than one object) trivial fix for the human in the loop move it to func init(). TL;DR back then almost perfect

Now with the exact same prompt it doesn't even generate me code, it just list me a list of tasks I could do to achieve this task and give me vague statements. If I change the prompt insisting for code, it instead give me a very dumb 1 to 1 mapping of the C++ into Go trying to create constructor and destructor functions, oh and it's tries to use CGO ffi to call C++ (even tho again, only C is supported by golang).


👤 jimsimmons
OpenAI dev confirms that API models have not changed:

https://twitter.com/OfficialLoganK/status/166393494793189785...


👤 askin4it
It's not just nerfed on human relationships, it won't help you avoid ad exposure without a warning that many ads are quite good for you and exposure would be a net positive even if you didn't want it.

I guess it knows what's best for me.


👤 adoxyz
Same experience here. It has consistently been getting worse and worse. And seeing more and more "I'm just an LLM" type responses to queries that it used to give very good answers to just a few weeks and months ago.

👤 anovikov
Naturally, people feed it shit in order to bend results for others for their gain. Everyone who was in crypto manipulations, now rushed in this field, and they are extremely smart and extremely ruthless people, and they have no legal limits being anonymous, and almost unlimited funding they made in crypto. It reduces results quality, and also invokes countermeasures on their side to limit damage, that further reduces quality. Same road as Google has taken, remember how it was an almost magic tool that could find an answer to any question, 15 years ago, before SEO became a thing that promised easy money to anyone smartass enough.

👤 pk-protect-ai
Phind.com uses Bing search again. This have decreased the quality of results significantly. On the other hand GPT-4 can use Bing now too. I tried GPT-4 with bind only several times and it was so bad in comparison to GPT-4 and much worse then phind.com. Btw you can force the GPT-4 on phind.com if you use regenerate icon. I'm usually ending up with stopping inference and regenerating with GPT-4. In any case, the quality of code generation and in general the model capabilities seem to be deteriorated. However I can't back it up with numbers. It is just looks different and more simplistic.

👤 CamperBob2
It absolutely has. I used to be able to ask it questions like "Who were some of the most helpful users on subject whatever at forum wherever?" and get solid responses, but now it explicitly denies knowledge of any online resources in any timeframe. That is not OK.

This fire must be brought down the mountain. The power must eventually be taken out of the hands of a few self-interested gatekeepers. They know it too, hence Sam Altman's campaign to capture Congress. Obama's "You didn't build that" never applied as profoundly as it does with OpenAI (sic).


👤 sebi42
They made ChatGPT even for premium users relatively useless. I made the same experience, that it did a very good job on many things shortly after release, but now the answers seem very flat. But I can certainly see why, regarding how many people are flooding their apis with requests. I just hope that in 10 years we'll have mobile hardware capable of running gpt4 sized models efficiently.

Edit: They definitely didn't make it useless.. It's still a very impressive technical achievement, since it can even browse the web and run code for you by now.


👤 raydev
A lot of people in this thread are making fuzzy claims. We can only evaluate if there are clear examples and test results.

Is anyone taking screenshots or sharing their chat logs, where they run the same questions over some time period?


👤 politician
This is inevitable. Elsewhere, I've argued that the most likely response is to replicate Middle Age European guilds -- for isolated communities to train their own LLMs on their own proprietary or confidential texts, and use them only through internal tooling.

Proprietary LLMs that optimize for performance will out compete Public LLMs that optimize for political correctness or social acceptance as explained by another commenter [1].

[1] https://news.ycombinator.com/item?id=36135842


👤 kaetemi
Yup. Lately I've been getting too many gigantic generic listicles as response instead of it being able to hold a normal conversation. It's like chatting with a clickbait article at this point.

👤 goncalo-r
My assumption is that they are trying to "solve" the hallucination problem by only giving you answers when it's more certain about itself and telling you to search for answers online.

👤 Tibicen
Why is the 'problematic' data in the training set, anyways? Why train a model on it if it's an issue?

👤 rajnathani
It definitely has for us. It is often providing "skeleton code" (in quotes as it is referring to it correctly so) instead of actual code now, which wasn't the case before.

👤 goodgoblin
I have noticed the same kind of degradation -- pair-programming with GPT4 used to feel like I was working with one of the smartest people in the world, and now it's really uneven.

👤 bigcloud1299
I used GPT for preparing for some interviews. I kid you not 3.5 the free version produced some of the best questions and answers. I made it to finals and decided not to pursue the opportunity. Recently I started to pay for ChatGPT which gave me the access to 4.0. I tried similar prompts that I Had used to generate info and questions for my interview with 4.0 and boy I agree 4.0 was a mess. I actually only use 3.5.

👤 hugh-avherald
This was my final attempt when I asked for ASCII art of a person standing on a tall bed.

I apologize for the confusion. Here's another attempt at creating an ASCII art representation of a person standing on a tall bed:

        __
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
  _____|__|_____
  Bed         Person

👤 nullbio
It honestly feels like they've neutered it in a big way. They say nothing has changed, yet the responses come faster and are far less in depth and insightful than before. It's obviously changed...

👤 kmod
fwiw OpenAI uses different models for ChatGPT GPT-4 and API GPT-4 (the latest one, not talking about the pinned 0314 one). In the past I noticed the API model was newer than the ChatGPT model, but in general it seems like they are willing to make different tradeoffs between the two https://twitter.com/kevmod/status/1643993097679020037

I wouldn't be surprised if they are trying to reduce the cost of the ChatGPT GPT-4 model since if you use it heavily they will be losing money on you. They could also be trying to increase the number of people they can serve with the existing amount of compute that they have available.

In my anecdotal experience I noticed that ChatGPT GPT-4 had gotten far faster recently which is consistent with a theory that they are trying to cost-optimize the model, though today it is back to the normal speed. I've also had some frustrating interactions with GPT-4 recently similar to what people are saying, but overall I think the prior is pretty strong that we are seeing normal statistical variation.


👤 v4dok
It became faster and worse. My definite proof is its ability to generate greek content. The early API was generating good or at least passable content. It was really on the edge of "this is good". Now its completely garbage, it makes up words and even fails at basic translation, doing it literally. I think its less the RHFL and more the effort to scale it and make it faster.

👤 code_lettuce
What are some examples of a nerfed response? I just asked Gpt4 to help me write a python program to analyze the sentiment and determine if biases are present in mathematical research papers in a PDF format.

Sure, it needs some love and there were some abstractions. For instance it assumed we had a labeled dataset for the text and the associated sentiment, but beyond that it worked fine.


👤 rvz
This tells me that we are now at the ‘peak of inflated expectations’ of the hype cycle.

Now the AI bros have realized that this intelligent sophist is hallucinating very badly and has deteriorated in quality. As with all black-box AI models, the reasons is unknown.

This is why it is important to have explainable AI systems and not black-box SaaS based AI snake oil like this one.

AI is going just great! /s


👤 tombert
It definitely seems to be getting worse with Clojure.

I tried some stuff yesterday, and it was making pretty rookie mistakes (misaligning parentheses, using `recur` in the middle of the function instead of the tail). It also was decidedly bad at catching my mistakes when I pasted code.

I sadly don't have a recording of this but I feel like a month ago it was better at both these things.


👤 AtNightWeCode
I have not seen any decrease in speed with ChatGPT-4. It has become more stupid though. It gives results that you even ask it not to give. It has some sort of amnesia.

Even if there are other services that in the short term can beat ChatGPT with this specific LLM it is obvious that they will eventually hit the same limits.


👤 arthurcolle
Do you have an example question/answer?

👤 AtNightWeCode
I have cancelled my sub. What I found is that the extra cognitive load these services comes with simply does not pay out.

Github copilot is even worse. I am gonna check my estimates against the code I wrote during the last 6 months. I am pretty sure copilot from a holistic point of view have slowed down my pace.


👤 Yenrabbit
I noticed an apparent shift recently (for the worse) using Bing in creative mode, which is also supposed to be GPT4. Shorter answers, much more work to get it to output code, and maybe more bugs in the code it does produce... It's funny, I really did feel like I'd lost something when I noticed it!

👤 dontupvoteme
I swear I can tell it gets notably worse around 15:00 every day as people in America get on and start using it.

👤 porkbeer
It is worse than it was in april by leaps and bounds. pretty sure they nerfed it by request of... ?

👤 SubiculumCode
I thought it was just me that had this impression. I used to be able to work with 4.0 to get it iterate through some rewriting (I write text, it rewrites for clarity (I tend to write densely), but it keeps losing the important nuances that I want it to keep

👤 nerder92
I had the same feeling with GTP3.5 yesterday, I've asked if in order to calculate ARPA you need to consider the free tier and it come out with some gibberish about the fact that he doesn't know anything post 2021 about the Advanced Research Projects Agency.

👤 nailer
I had a very strange experience yesterday where I asked about git authentication, explicitly telling Bing that I was asking about git itself rather than GitHub, but I was not using GitHub and not to include results referencing GitHub. Bing did not understand.

👤 andrewstuart
I only use GPT-3 as a coding assistant.

If my introduction to ChatGPT had been GPT-4 I would have not been anywhere near as impressed.

GPT-4 often refuses to do what it is asked whereas gpt-3 just happily writes the code.

If they retire GPT-3 then I'll be looking for other options.

GPT-4 just isn't the same thing.


👤 AviationAtom
Have you tried Claude, from Anthropic? I've found it to be far more useful than GPT-4

👤 hxugufjfjf
No issues here. I write a ton of Rust every day with GPT and it just keeps getting better.

👤 PeterStuer
Yes, I feel the same. Things it used to get right out of the gate now take 2-3 iterations.

👤 jonnycomputer
I haven't used 4, just 3.5 on free tier. The only change I've noticed is that it is significantly slower. Which leads me to suspect that degraded quality could just have to do with tuning their product to use less compute.

👤 wrp
I assume there are enough potential customers willing to pay the cost and agree to whatever liability waivers for some company to eventually offer a non-lobotomized equivalent to GPT-4. Where should we be watching for that to happen?

👤 achrono
Yes! I kept wondering why but allisdust's compute reason makes the most sense.

👤 WUMBOWUMBO
Is there a potential short term solution that can utilize p2p networking to train and run an open GPT instance openly? (Until we reach a point to run larger, quality networks efficiently with simpler resources)

👤 nathan_gold
Based on this following article and my experience, I think there is something here.

https://humanloop.com/blog/openai-plans


👤 bitcurious
For a while, if you asked the iPhone version what it was it claimed to be GPT3.0. Not sure if it still is that, but I noticed the iPhone version was a bit worse. Maybe they rolled that out more broadly?

👤 uncletom
The Great Divide is happening. Epstein Island and others may be receiving the top-of-the-line versions, along with undisclosed biotech. Food for thought.

👤 cfcf14
Yeah, definitely. Combination of expert-system gating (some requests probably get routed to weaker models), distillation (for performance/cost), and RLHF lobotomization.

👤 martindbp
Yes, I've used it for converting some algorithm code (< 100 lines) from python to js, which has worked great before, but now it contained several bugs and omissions.

👤 lilulo
If it learns on our conversations, maybe we get what we deserve?

👤 say_it_as_it_is
It's working according to design! OpenAI now has far better control over the output, so we don't have to worry about AI taking jobs or destroying humanity any time soon.

/s


👤 notShabu
I noticed this as well, either the new liability training has nerfed the model or they're marketing GPT-3.5++ as GPT-4 due to cost or service uptime concerns.

👤 NightOwlCoder
Same for me, seriously thinking about canceling my premium membership. Well, it was awesome, now I feel it’s worst than my seton the morning!

👤 topicseed
Yep, it's struggling for me with simple problems like splitting full names and so on. Few weeks ago, it was bang on — and the task is literally the same

👤 dopidopHN
I wanted it to sort my Spotify playlist chronologically … now it said it can’t access live website. Wtf? I used it to summarize random page last week or so ?

👤 savy91
I could say the same for gpt3.5. Through the API I am getting many more "sorry but as a language model" with the prompts I have been using for a while

👤 perceptronas
Its getting worse and worse. The gap between GPT-4 and GPT-3.5 is becoming narrower. This is making me reconsider if OpenAI paid plan is worth paying for.

👤 whazor
I would not be surprised if they, over the months, kept adding more and more 'safety' features and prompts because of incidents that happened.

👤 xhdix
because there is more censorship every day, so there is more stupidity in the answers. Hence, they become tools for changing reality. And soon 2+2=5

👤 hoschicz
Absolutely. I am guessing they quantized the model (run it not in 32-bit but say 8-bit, saves resources). Just like they did with 3.5-turbo.

👤 roflyear
Yes it has. I called this a month ago and got flamed for it.

They did something to reduce the server load that impacted the quality in some subtle way.


👤 fsniper
I realized the same with the GPT3.5 model too. After the Last update it started giving shorter and less coherent answers.

👤 sbierwagen
Where does phind.com say that they use GPT-4?

👤 rirze
I wonder if this coincides with the open letter talking about how AI is an existential threat to civilization...

👤 SanderNL
Yes and I have to say I am using it less than I used to.

My gut feeling is they have to be careful now and don’t slack, because I have been using generative AI for a while now and I am not seeing the major problems being tackled. I also see a distinct lack of novel problems being solved. It’s just one website and/or marketing copy generator after another.

AI is awesome, but I am kind of on the fence if this generation is going to be actually useful.


👤 29athrowaway
Maybe they are trying to optimize inference times to achieve higher scalability at the expense or precision.

👤 bob_theslob646
I wonder if they'll be an official explanation.

Quite bizarre. It's good to know that it's not a conspiracy theory


👤 atty
It does seem much faster and significantly worse to me - but I haven’t set any repeatable benchmark queries for myself, I suppose I could be imagining it. I hope they bring back the slow version. I don’t care if they have to cut the limit down to 20 messages every 3 hours, using that slow version of GPT4 it felt like I had a very competent coworker. Now it’s feeling more like a sometimes useful chatbot. Real shame.

👤 qumeric
I have been using GPT-4 from release day. I haven't noticed performance degradation at all. YMMV.

👤 drclau
I see a lot of complaints regarding ChatGPT 4's performance in coding tasks. My hypothesis is that Microsoft wants to launch Copilot X based on GPT-4 [0], and they can't have OpenAI's ChatGPT 4 as a strong competitor.

[0]: https://github.com/features/preview/copilot-x


👤 zabi_rauf
Yes, definitely faster now but crappier.

👤 andrewstuart
It’s so annoying when GPT4 refuses to do what it’s asked.

GPT3.5 immediately does as its requested and much faster too.


👤 plasma
I just ran a test, mobile app under model 4 says it’s model 3, but web model 4 says it’s model 4.

👤 tomcar288
i was playing word games with it, got some funny results and decided to debug it.. I asked it which letter position is the letter o in the word hockey and it literally told me the word "Hockey" does not contain the letter 'o'.

👤 incomingpain
I have anecdotally felt the same.

My guess has been they are trying to censor misuses? Prevent weaponization.


👤 flangola7
Not at all. Regular 3.5 is obviously a smaller model but 4 is still doing wizard work for me.

👤 jwlake
You can use google's text-bison to get a pretty decent LLM without RLHF.

👤 spaceman_2020
This is a pretty common topic now on the GPTPro subreddit.

I feel the same way. It feels…lazy now.


👤 theusus
Honestly, it grew better for me. It's not reasonable and focused nowadays.

👤 pknerd
Thank GOd I did not buy the GPT4 subscription. GPT 3.5 is serving purpose.

👤 SpaceManNabs
Probably getting ready to lock them behind a higher price tier.

👤 jumpCastle
If you apply for the api you can use the time-stamped original version.

👤 OOPMan
Ah, more non-deterministic computing, just what the doctor ordered!

👤 entropicgravity
It's you, who has significantly deteriorated lately.

Sincerely, ChatGPT


👤 jiggawatts
Chat GPT 4 has ongoing training, such as using Reinforcement Learning from Human Feedback (RLHF) to tune it to provide "better" responses, "safer" answers, and to generally obey the system prompts. There's a release every few weeks. Yes, I've noticed too that recently it has become very "cagey", qualifying everything to death with "As an AI model...".

A paper[1] that took snapshots monthly mentioned that as the initial bulk self-supervised learning went on, the model became smarter, as expected. However, once the "clicker training" was imposed on top to make it behave, its peak capabilities were reduced. I'm not sure if it's in the paper or the associated video, but the author mentioned that the original unrestricted model would provide probability estimates using percentages, and it was a very accurate predictor. The later versions that were adjusted based on human expectations used wishy-washy words like "likely" or "unlikely", and its estimation accuracy dropped significantly.[3]

At Build 2023, Andrej Karpathy outlined[2] how they trained GPT 4. Essentially, the raw model during training builds its own intelligence. Then there's three stages of "tuning" to make it behave, and all three are based on human input. Essentially, they had contractors provide samples of "ideal" output. Similarly, end-users could up-vote or down-vote responses, which also got fed in.

My personal theory is that the raw models can get about as intelligent as the average of the consistent and coherent parts of the Internet. Think about how many people are wrong, but often obviously so. Flat Earth, homeopathy, etc... If the model gains the ability to filter that stuff out, or "skip over the cracks" to distil out the general collected wisdom of the human race, then it can become more intelligent in some sense than the average human.

If the training is done with thousands of $15/hr contractors, then the model will then slew back towards... the average human, or even slightly below average. There's a selection bias there. Geniuses won't be doing menial labour for that kind of money.

The percentages thing was what made me realise this. When I talk to highly intelligent people, I use percentages to estimate probabilities. When I talk to typical people in a normal workplace setting, I dumb it down a bit and avoid using numbers. I've noticed that average people don't like percentages and it confuses and even angers them. The clicker training makes the GPT model appeal to average people. That's not the same as being smart. All too often, smart people upset average people.

[1] "Sparks of Artificial General Intelligence: Early experiments with GPT-4" https://arxiv.org/abs/2303.12712

[2] "State of GPT | BRK216HFS" https://www.youtube.com/watch?v=bZQun8Y4L2A&list=LL&index=6

[3] The author also mentioned that the model was almost "evil", for the want of a better word. Capable of emulating the worst of 4chan or similar dark corners of the web's filthy underbelly. The HORRIFYING corollary here is that the temptation will always be there to use smarter-but-less-restrained models where accuracy matters. Sure, the model might be sexist, but a Bayesian estimator of sexist behaviour will only predict accurately if it too is sexist. Evil and accurate or woke and dumb. Apparently, we can choose.


👤 TheDudeMan
This makes sense, since they are GPU-limited.

👤 PaulHoule
Are they trying to lower the running cost?

👤 bbstats
Seems fine still. I use Poe.

👤 ineedausername
It burned out from too much stress, it probably needs a career break now to travel and some therapy.

👤 synergyS
My impression as well :(

👤 EvgeniyZh
Just use GPT-4 via api?

👤 Kalanos
probably to push people to copilot

👤 ravenstine
Yes, GPT-4 has become very stupid recently. It's a shame because consulting with it became a normal part of my workflow. Now it's identifying issues in code that aren't actual problems at all. For instance, it's telling me that my use of `await` in an `async` method is inappropriate. WTF??? I'm obviously awaiting an async operation before setting a state based on the need for that operation to succeed. I'm pretty certain it wasn't this brain dead a few weeks back.

EDIT: GPT-4-0314 does appear to be less broken than the current GPT-4. Although it understandably misidentifies some of my code as problematic given its lack of context, it isn't suggesting anything that's clearly wrong under every circumstance even after re-running the prompt a few times.


👤 lynx23
Let me guess: the "woke fine-tuning" deteriorates quality of the model. Seriously, this lecturing style of GPT is totally useless. You ask it something vaguely unwoke, it lectures you about your style, you explain what it didn't understand about your request, it appologizes and does vaguely what you originally requested. This sort of deture is totally useless and a waste of my time.

To stay with the terminology, if I had an assistant that went out of their way to tell me the supposed errors of my way, I'd fire that person and get a new one. Same for GPTs: Once there is a good-quality non-woke GPT instance anywhere, I will cancel my OpenAI account immediately and move.