Is someone properly/scientifically testing if they get the same level of quality over time? It seems like model distillation works well on benchmarks and they could "easily" improve their gross margins, by swapping in a smaller/cheaper model.
E.g. we've seen isolated performance drops from gpt-4o-2024-05-13 to the september version that also came with big price cuts.
WDYT?
Another possibility is that that the price cuts come from weight quantization which can degrade quality.