For a while, the GPT-4 on phind.com gave even better results than GPT-4-powered ChatGPT. I could notice the difference in speed of both GPT-4s. Phind's was slower and more accurate. I say "was" because apparently phind is now trying to use GPT-3.5 and their own Phind model more frequently, so much for GPT-4 powered search engine....
I wonder if I use Poe's GPT-4, maybe I'll get the good old GPT-4 back?
The original GPT-4 felt like magic to me, I had this sense of awe while interacting with it. Now it is just a dumb stochastic parrot.
Before, if I had an issue with a library or debugging issue, it would try to be helpful and walk me through potential issues, and ask me to 'let it know' if it worked or not. Now it will try to superficially diagnose the problem and then ask me to check the online community for help or continuously refer me to the maintainers rather than trying to figure it out.
Similarly, I had been using it to help me think through problems and issues from different perspectives (both business and personal) and it would take me in-depth through these. Now, again, it gives superficial answers and encourages going to external sources.
I think if you keep pressing in the right ways it'll eventually give in and help you as it did before, but I guess this will take quite a bit of prompting.
That can easily happen and it's why, for example, medical studies, are not taken as definitive proof of an effect.
To further clarify, regression to the mean is the inevitable consequence of statistical error. Suppose (classic example) we want to test a hypertension drug. We start by taking the blood pressure (BP) of test subjects. Then we give them the drug (in a double-blind, randomised fashion). Then we take their blood pressure again. Finally, we compare the BP readings before and after taking the drug.
The result is usually that some of the subjects' BP has decreased after taking the drug, some subjects' BP has increased and some has stayed the same. At this point we don't really know for sure what's going on. BP can vary a lot in the same person, depending on all sorts of factors typically not recorded in studies. There is always the chance that the single measurement of BP that we took off a person before giving the drug was an outlier for that patient, and that the second measurement, that we took after giving the drug, is not showing the effect of the drug but simply measuring the average BP of the person, which has remained unaffected by the drug. Or, of course, the second measurement might be the outlier.
This is a bitch of a problem and not easily resolved. The usual way out is to wait for confirmation of experimental results from more studies. Which is what you're doing here basically, I guess (so, good instinct!). Unfortunately, most studies have more or less varying methodologies and that introduces even more possibility for confusion.
Anyway, I really think you're noticing regression to the mean.
The answer is the same on GPT plus and API with GPT-4, even with "developer" role.
I'm guessing that trend is continuing...
It occurs because the system essentially does a latent classification of problems into 'acceptable' or 'not acceptable' to respond to. When this is done, a decent amount of information is lost regarding how to represent these latent spaces that may be completely unrelated (making nefarious materials, or spouting hate speech are now in the same 'bucket' for the decoder).
This degradation was observed quite early on with the tikz unicorn benchmark, which improved with training, and then degraded when fine-tuning to be more safe was applied.
svg editor:
early april: https://chat.openai.com/share/c235b48e-5a0e-4a89-af1c-0a3e7c...
now: https://chat.openai.com/share/e4362a56-4bc7-45dc-8d1b-5e3842...
originally it correctly inferred that I wanted a framework for svg editors, the latest version assumes I want a js framework (I tried several times) until I clarify. It also insists that the framework cannot do editable text until I nudge it in the right direction.
Overall slightly worse but the code generated is still fine.
word embeddings:
early april: https://chat.openai.com/share/f6bde43a-2fce-47dc-b23c-cc5af3...
now: https://chat.openai.com/share/25c2703e-d89d-465c-9808-4df1b3...
in the latest version it imported "from sklearn.preprocessing import normalize" without using it later. It also erroneously uses pytorch_cos_sim, which expects a pytorch tensor whereas we're putting in a numpy array.
overall I think the quality has degraded slightly, but not by enough that I would stop using it. Still miles ahead of Bard imo.
One of my least favorite patterns that tech companies do is use “Experimentation” overzealously or prematurely. Mainly, my problem is they’re not transparent about it, and it creates an inconsistent product experience that just confuses you - why did this one Zillow listing have this UI order but the similar one I clicked seconds later had a different one? Why did this page load on Reddit get some weirdass font? Because it’s an experiment the bar to launch is low and you’re not gonna find any official blog posts about the changes until it’s official. And when it causes serious problems, there’s nowhere to submit a form or tell you why, and only very rarely would support, others, or documentation even realize some change is from an experiment. Over the past few years I’ve started noticing this everywhere online.
Non-sticky UI experiments are especially bad because at eg 1% of pageloads the signal is going to be measuring users asking themselves wtf is up and temporarily spending more time on page trying to figure out where the data moved. Sticky and/or less noticeable experiments like what this could be have stronger signals but are even more annoying as a user, because there’s no notice that you’re essentially running some jank beta version, and no way to opt back into the default - for you it’s just broken. Especially not cool if you’re a paying customer.
I’m not saying it’s necessarily an experiment, it could be just a regular release or nothing at all. I’d hope if OpenAI was actually reducing the parameter size of their models they’d publicly announce that, but I could totally see them running an experiment measuring how a cheaper, smaller model affects usage and retention without publishing anything, because it’s exactly the kind of “right hand doesn’t know what the left is doing” thing that happens at fancy schmancy tech companies.
Or does it make even more sense to hoard the capability to out compete any competitor, of any kind, commercially or politically and hide the true extent of your capability to avoid scrutiny and legislation?
I'm going with the latter. Perhaps now, perhaps in the very near future, the power of these capabilities is novel. Like an information nuclear weapon.
I'd be dialing back the public expectations and deploying the capability in a novel way to exploit it as the largest lever I could.
The more unseen the lever, the longer.
I think any other strategy is myopic from a competition perspective. The power of these models isn't direct utility, it is compounded by secrecy because their useful work isn't directly observable as coming from the model.
The first fix is free.
The second one will cost you money.
The third one will be laced with fillers and have degraded quality.
It didn't even try on my question of Jarvis standings desks, which is a fairly old product that hasn't changed up.. Their typical "My knowledge cutoff..." response doesn't even make sense. It screwed up another question I asked it about server uptime and four-9s, Bard got it right.
I've moved back to Bard for the time being...It's way faster as well. And GPT-4's knowledge cutoff thing is getting old fast.
Example prompt: how much does a 48 x 30 inch jarvis bamboo standing desk setup weigh?
I still have the spec sheet from my order back in 2020. So I can verify the weights.
### GPT-4 ###
ChatGPT As of my knowledge cutoff in September 2021, I don't have the exact weight of a 48 x 30 inch Jarvis Bamboo standing desk setup. However, on average, standing desks can range anywhere from around 60 to 100 pounds or more, depending on the specific configuration (like the choice of frame, desktop material, and any additional features).
Jarvis Bamboo standing desks are made of a bamboo top and an electric lift frame. The weight of the bamboo top for that size should be around 30-40 pounds. The frame itself can weigh anywhere from 50-70 pounds.
For a more accurate weight, I would recommend checking the manufacturer's specifications on their website or contacting their customer support directly for this information.
### Bard ###
The fully assembled Jarvis Bamboo Standing Desk weighs 92 pounds. The desktop itself weighs 38 pounds, and the frame weighs 54 pounds. The desk can hold a maximum weight of 350 pounds.
Here are some of the features of the Jarvis Bamboo Standing Desk:
Electric height adjustment from 25.5 to 51 inches Memory presets for up to 4 different heights 3-stage gas spring mechanism Bamboo desktop with a scratch-resistant finish Black powder-coated steel frame 10-year warranty on the frame and 5-year warranty on the desktop
https://platform.openai.com/playground?mode=chat&model=gpt-4...
Costs $0.12 per thousand tokens (~words), and I find even fairly heavy use rarely exceeds a dollar a day.
I'm personally of the opinion that the observable jump in quality between 3.5 and 4 inflated people's initial assessment of its capabilities and with continued use they are noticing it's not actually the omniscient machine god that many are so inclined to believe.
Either way, these kinds of posts are meaningless without some kind of objective standard to judge the model by, everyone just sees what they want to see. Despite claims of GPT4 being nerfed, I've yet to see anyone actually show evidence of it. There have been dozens of studies done on its capabilities so this is something that can actually be demonstrated empirically if it's true.
'As a final note, remember that real color rendering on a monochrome screen is physically impossible, as the monitor itself is not capable of producing colors. The best one can do is different shades of monochrome, possibly with differing intensity.'
I think they programmed a classifier layer to detect certain coding tasks and shut it down with canned BS. I like to imagine certain billion/trillion-dollar mega corps had a back-room say regarding things that they would really prefer OpenAI's models not be able to emit. Microsoft is a big stakeholder and they might not want to get sued... Liability could explain a lot of it.
Conspiracy shenanigans aside, I've decided to cancel my "premium" membership and am exploring open/DIY models. It feels like a big dopamine hangover having access to such a potent model and then having it chipped away over a period of months. I am not going through that again.
> GPT-4 from its website and Bubeck et al Mar 2023. Note that the version that Bubeck uses is GPT-4 Early which is supposedly to be more powerful than GPT-4 Launch (OpenAI paid a lot of alignment tax to make GPT-4 safer).
https://github.com/FranxYao/chain-of-thought-hub
Anecdotally, there seemed to be a golden set of weeks in late April to early May that seemed like "peak GPT" (GPT-4), followed by heavy topic and knowledge mitigation since, then -- just this week -- adding back some "chain of thought" or "show your work" ("lets go step by step" style) for math. I say anecdotally because I could just be prompting it wrong.
There are also very telling patterns of response that indicate a pre gpt-4 model.
1: All previous models suffered terribly if your chat got too long. After 20 or so responses it would suddenly start to feel less attentive to what is being said and output superficial or incorrect responses.
2: If you stop a chat midway and come back later to continue (after a refresh or a different chat interaction), it would often respond with code or suggestions that have nothing to do whatsoever with your prompt.
Both these patterns are sometimes evident in the current model. Likely then, there is some clamping down on its capabilities.
My suspicion is, this probably relates to computing resources. The 25 messages cap must mean that it’s difficult to scale its performance. And the only way to do so is to simplify the model activations with heuristics. Perhaps analyzing and preprocessing the input to see how much of the model needs to be used (partial model use can be architected).
This seems to be the simplest explanation of observed state and behaviour.
I ask it lay-of-the-land questions about technical problems that are new to me to detailed coding problems that I understand well but don’t want to figure out.
The best though, is helping me navigate complicated UI’s. I tell it how I want some complicated software / website to behave, and it’ll tell me the arcane menu path to follow.
It’s funny how computing might soon include elements of psychology and magic incantations nobody understands.
If i don't want the RLHF to get in my Way, i switch over to the API (sadly not the 4.0 one).
I also noticed a decline in following Instructions, i have a Primer i am preseeding my Chats with.
The Primer ends with "Do you understand? [Y|N]", and ChatGPT 3.5 usually answered with a Summary, ChatGPT 4.0 in the beginning just wrote "Y".
Now it behaves like 3.5, answering with a Summary instead of a "Y". Adjusted the Prompt to -> "Confirm instructions with a short and precise "Ok"." which seems to work.
Used Primer: https://github.com/Kalabint/ChatGPT-Primer
I think I read when it was released it was 32K tokens, then quickly scaled back to 8K tokens. I'm guessing now that they've further reduced it. Maybe it's 6000 tokens vs. GPT-3.5's 4K? I don't know. But it's certainly noticeably worse at every task I give it.
I'm always skeptical of subjective feelings of tough-to-quantify things getting worse or better, especially where there is as much hype as for the various AI models.
One explanation for the feelings is the model really getting significantly worse over time. Another is the hype wearing off as you get more used to the shiny new thing and become more critical of its shortcomings.
LLM's integrated into any real product requires a model hash, otherwise the provider of the model has full control over any sort of deception.
If it's a scaling problem they should just rename it to GPT 4+ or something and raise prices than just degrade the experience for everyone. I'm sure a lot of people will happily pay more to get the original quality back than this watered down version.
Ai will not be a technical problem (nor a solution!), rather our civilization will continue to be bottlenecked by problems of culture. OpenAI will succeed/fail for cultural reasons, not technical ones. Humanity will benefit from or be harmed by ai for cultural reasons, not technical ones.
I don’t have any answers here. I do however get the continuous impression that ai has not shifted the ground under our feet. The bottlenecks and underlying paradigms remain basically the same.
I am at the point that 4.0 is basically not worth using as single entity, but it seems that using the api and generating some combative/consultative agents yields some interesting results, but not super fast.
Check this out if you have not seen it already : "AutoGPT Test and My AI Agents Effortless Programming - INSANE Progress!"
I hope you're all ready with your pumpkin spiced lattes and turkeys because it seems like we're in AI autumn and we all know what comes after that.
`Can you refactor this function to make it: * More readable * Split on different parts * Easy to test
Consider the use of generators and other strategies. (code here)`
I am concerned that I can't determine a natural process from a manufactured one.
To clarify, I have become increasingly less impressed with GPT-4.
Is this a natural process? Is it getting worse?
I personally lean towards the hypothesis that it is getting worse as they scale back the resource burn, but I can't know for certain.
As a developer, it still has value to me. GPT 3.5 has value. But, whereas initially it actually made me afraid for my job, now it really is a bit tedious to extract that value.
If you are incorporating machine models into your development process very heavily, make sure you're still building deep knowledge and deep context of your codebase. Otherwise you'll atrophy in that department because you falsely believe the machine maintains the context for you.
That isn't feasible right now because the cost of maintaining huge context per user to enable milking rather than enhanced-thinking is too high.
Also consider that we don't want to enable milking. People who have no idea what they're doing will just throw a lot of code at the wall and then when there's a huge mess they'll be asking engineers to fix it. We need to be careful with who has this kind of access on our teams in general. Someone who is non-technical should not be given a firecracker before they've ever even just turned on a microwave or a stove.
GPT-4 was remarkable 2 months ago. It could handle complex coding tasks and the reasoning was great. Now, it feels like GPT-3. It's sad. I had many things in mind that could have been done with the original GPT-4.
I hope we see real competitors to GPT-4 in terms of coding abilities and reasoning. Absence of real competitors made it a reasonable option for "Open"AI to lobotomize GPT-4 without a notice.
- They might've not been prepared for the growth;
- They were prepared, but decided to focus on the 95% of "regular people with easy questions" that treat it as a curiosity instead of the fewer people with difficult questions. Most regular people have no idea who is OpenAI, what a LLM is or how a GPT works, but they know "ChatGPT, the brand". Since it became a household name so quickly, it would be far better that the AI is just a little underwhelming sometimes than for it to not be able to serve so many people.
- The corpus used to generate it was trained on a staggering amount of content. This includes fringe and unmoderated content. Imagine you asked a question about WW2, and being trained, lets say on 4chan, the model responds with a very charitable bias about the solid reasoning behind the reich's actions at the time... It does not look good for investors, for the company, and attracts scrutiny. Even more innocuous themes are enough to invite all kinds of bad faith debate, radical criticism and whatnot... and the darling of the "IA revolution" certainly does not need (or want) coverage outside of their roadmap wishes.
The companies implementing ChatGPT are going to restrict -our- access to the APIs and the most relevant conversations.
It is in this manner that both OpenAI and the companies its selling its technologies to will benefit. In this manner OpenAI profits from both sides. You and I are left out though unless we want to pay much more than we're paying now.
This time it wrote the function. I could see it for a second, and then changed it to the message something like "I can't talk about this right now." I don't remember exactly, but the interesting thing was that it flashed an answer and then withdrew it.
Its Rust abilities seem to have deteriorated over the past couple weeks
This is why the headline for this article says GPT4 but the body is focused on mentioning phind.com.
I've also started to notice it's been making a lot of typos eg. yesterday while converting Kelvin to Celsius, it incorrectly stated that 0C is -274.15K, despite correctly stating in the previous sentence the correct value of -273.15K.
// Your previous code here
foo(bar) // added foo() call here to do xyz
If I were to speculate I would say that this would reduce the amount of work it has to do in that it needs to generate less content and the replies are shorter but I feel like this has a slightly performance loss. I'm not sure exactly why there's a performance loss but I could see it being the case where generating the entire file with specific line edits could allow for better predictions on the code / file versus trying to only reply with the changes needed. I wonder if this is a tweak in the prompt or if the model itself is different.
Alternatively, there should be a trusted auditor who uses a secret benchmark.
Right now, I don’t think it’s possible to scale a service really big based on GPT-4 because of cost.
You’re screwed, the AI was your most important tool and it’s broken and there’s nothing you can do about it, it’s a black box, you don’t control it.
It was about writing a CGO wrapper given an hpp header for a lib I have. Back then it used to give me almost correct code, it understood is had to write a C++ C ffi using extern "C" first because Golang's CGO FFI only support C not C++. And then it generated me correct CGO wrapper with a Go looking type that made sense. The only wrong thing is that it didn't understood it had to call the initialization function of my C++ lib at init time, instead it called it in the New factory function (which would segfault when you build more than one object) trivial fix for the human in the loop move it to func init(). TL;DR back then almost perfect
Now with the exact same prompt it doesn't even generate me code, it just list me a list of tasks I could do to achieve this task and give me vague statements. If I change the prompt insisting for code, it instead give me a very dumb 1 to 1 mapping of the C++ into Go trying to create constructor and destructor functions, oh and it's tries to use CGO ffi to call C++ (even tho again, only C is supported by golang).
https://twitter.com/OfficialLoganK/status/166393494793189785...
I guess it knows what's best for me.
This fire must be brought down the mountain. The power must eventually be taken out of the hands of a few self-interested gatekeepers. They know it too, hence Sam Altman's campaign to capture Congress. Obama's "You didn't build that" never applied as profoundly as it does with OpenAI (sic).
Edit: They definitely didn't make it useless.. It's still a very impressive technical achievement, since it can even browse the web and run code for you by now.
Is anyone taking screenshots or sharing their chat logs, where they run the same questions over some time period?
Proprietary LLMs that optimize for performance will out compete Public LLMs that optimize for political correctness or social acceptance as explained by another commenter [1].
I apologize for the confusion. Here's another attempt at creating an ASCII art representation of a person standing on a tall bed:
__
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
_____|__|_____
Bed Person
I wouldn't be surprised if they are trying to reduce the cost of the ChatGPT GPT-4 model since if you use it heavily they will be losing money on you. They could also be trying to increase the number of people they can serve with the existing amount of compute that they have available.
In my anecdotal experience I noticed that ChatGPT GPT-4 had gotten far faster recently which is consistent with a theory that they are trying to cost-optimize the model, though today it is back to the normal speed. I've also had some frustrating interactions with GPT-4 recently similar to what people are saying, but overall I think the prior is pretty strong that we are seeing normal statistical variation.
Sure, it needs some love and there were some abstractions. For instance it assumed we had a labeled dataset for the text and the associated sentiment, but beyond that it worked fine.
Now the AI bros have realized that this intelligent sophist is hallucinating very badly and has deteriorated in quality. As with all black-box AI models, the reasons is unknown.
This is why it is important to have explainable AI systems and not black-box SaaS based AI snake oil like this one.
AI is going just great! /s
I tried some stuff yesterday, and it was making pretty rookie mistakes (misaligning parentheses, using `recur` in the middle of the function instead of the tail). It also was decidedly bad at catching my mistakes when I pasted code.
I sadly don't have a recording of this but I feel like a month ago it was better at both these things.
Even if there are other services that in the short term can beat ChatGPT with this specific LLM it is obvious that they will eventually hit the same limits.
Github copilot is even worse. I am gonna check my estimates against the code I wrote during the last 6 months. I am pretty sure copilot from a holistic point of view have slowed down my pace.
If my introduction to ChatGPT had been GPT-4 I would have not been anywhere near as impressed.
GPT-4 often refuses to do what it is asked whereas gpt-3 just happily writes the code.
If they retire GPT-3 then I'll be looking for other options.
GPT-4 just isn't the same thing.
/s
They did something to reduce the server load that impacted the quality in some subtle way.
My gut feeling is they have to be careful now and don’t slack, because I have been using generative AI for a while now and I am not seeing the major problems being tackled. I also see a distinct lack of novel problems being solved. It’s just one website and/or marketing copy generator after another.
AI is awesome, but I am kind of on the fence if this generation is going to be actually useful.
Quite bizarre. It's good to know that it's not a conspiracy theory
GPT3.5 immediately does as its requested and much faster too.
My guess has been they are trying to censor misuses? Prevent weaponization.
I feel the same way. It feels…lazy now.
Sincerely, ChatGPT
A paper[1] that took snapshots monthly mentioned that as the initial bulk self-supervised learning went on, the model became smarter, as expected. However, once the "clicker training" was imposed on top to make it behave, its peak capabilities were reduced. I'm not sure if it's in the paper or the associated video, but the author mentioned that the original unrestricted model would provide probability estimates using percentages, and it was a very accurate predictor. The later versions that were adjusted based on human expectations used wishy-washy words like "likely" or "unlikely", and its estimation accuracy dropped significantly.[3]
At Build 2023, Andrej Karpathy outlined[2] how they trained GPT 4. Essentially, the raw model during training builds its own intelligence. Then there's three stages of "tuning" to make it behave, and all three are based on human input. Essentially, they had contractors provide samples of "ideal" output. Similarly, end-users could up-vote or down-vote responses, which also got fed in.
My personal theory is that the raw models can get about as intelligent as the average of the consistent and coherent parts of the Internet. Think about how many people are wrong, but often obviously so. Flat Earth, homeopathy, etc... If the model gains the ability to filter that stuff out, or "skip over the cracks" to distil out the general collected wisdom of the human race, then it can become more intelligent in some sense than the average human.
If the training is done with thousands of $15/hr contractors, then the model will then slew back towards... the average human, or even slightly below average. There's a selection bias there. Geniuses won't be doing menial labour for that kind of money.
The percentages thing was what made me realise this. When I talk to highly intelligent people, I use percentages to estimate probabilities. When I talk to typical people in a normal workplace setting, I dumb it down a bit and avoid using numbers. I've noticed that average people don't like percentages and it confuses and even angers them. The clicker training makes the GPT model appeal to average people. That's not the same as being smart. All too often, smart people upset average people.
[1] "Sparks of Artificial General Intelligence: Early experiments with GPT-4" https://arxiv.org/abs/2303.12712
[2] "State of GPT | BRK216HFS" https://www.youtube.com/watch?v=bZQun8Y4L2A&list=LL&index=6
[3] The author also mentioned that the model was almost "evil", for the want of a better word. Capable of emulating the worst of 4chan or similar dark corners of the web's filthy underbelly. The HORRIFYING corollary here is that the temptation will always be there to use smarter-but-less-restrained models where accuracy matters. Sure, the model might be sexist, but a Bayesian estimator of sexist behaviour will only predict accurately if it too is sexist. Evil and accurate or woke and dumb. Apparently, we can choose.
EDIT: GPT-4-0314 does appear to be less broken than the current GPT-4. Although it understandably misidentifies some of my code as problematic given its lack of context, it isn't suggesting anything that's clearly wrong under every circumstance even after re-running the prompt a few times.
To stay with the terminology, if I had an assistant that went out of their way to tell me the supposed errors of my way, I'd fire that person and get a new one. Same for GPTs: Once there is a good-quality non-woke GPT instance anywhere, I will cancel my OpenAI account immediately and move.