Am I missing something?
For a while I ran an AI app with summary features, with T5/BART in 2020 and then generative from 2021 (using GPT-3 API). It's an unfortunate truth that when I tried using Babbage (GPT-2 size), it was ok for happy path examples, but user retention crashed.
Even Curie was much worse than Davinci in terms of user satisfaction.
One interesting finding from my data was that shorter outputs had less difference in quality between models - longer responses, it mattered more which model you used (this was before Chat, and even before instruct/RLHF GPT models, so before the constrained output forms of verbosity conversational defaults)
For AI applications, users have a 1 strike policy in general for trust. So you kind of need to overdial things to retain users. Might be different for dataset work, but there absolutely is a quality difference over enough uses.
BART or T5 were "better" than barebones GPT-2 for summarisation initially, but that does not mean anybody really uses them for summarisation (in a sense, the task they were most optimised to do during creation).
I do think most of us feel summaries from latest models are getting to as good as is reasonably expectable. So sure if GPT-5 comes and is particularly great at remaining tricky tasks, why not stick with 4 for summaries.
That's not to mention context length is longer in terms of easy "do it today" ways to access a model for ChatGPT or GPT-4, vs GPT-2 models which are much shorter context lengths in terms of input.
The only point of keeping using GPT-2 is to compare it in benchmarks against newer models.
The better alternatives are often bigger in size but you will find better models that are also smaller or similarly sized. However, a rule of thumb is the bigger the model is, the better the quality of the output will be. So you need to aim for the best model that fits on your hardware. Personally I wouldn't go below 6B parameters server side anymore. When buying new hardware, I would aim for something that can run at least 30B, or 13B if you are on a budget.
The benchmarks aren't perfect but here is a leaderboard about the open large langage models: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
You can see that GPT-2 is at the bottom.
eg summarization of scientific articles, cooking blog posts, news articles, and racing events are all completely different use-cases. A fine-tuned GPT2 may or may not meet the performance of GPT4, but I would bet the out-of-the-box GPT4 performance beats GPT2. On the other hand, one also needs to consider the cost of GPT2 vs GPT4 against the quality-sensitivity of the particular use-case.
The big deal about GPT4 is it does many novel tasks very well with a fairly cheap usage-based pricing model, zero infrastructure or fine-tuning required. Businesses like this.
GPT-4 isn't necessary for "most tasks"; it's only really useful for the really complex ones that humans have trouble with. I'd say GPT-3.5-turbo is in the sweet spot for "most tasks". 3.5 appears to have lower performance than the GPT-3 davinci model, but it's a heck lot cheaper.