I am thinking about creating a page for testing those claims. Perhaps something similar already exists. Anyway, I think that page could be useful to determine whether, in general, LLM are useful for deep questions.
Obviously, that page should use an ensemble of the best models and there should be limits to the number of models, time and budget for computation. That costs real money.
I think the battle between editor and contributors to wikipedia and LLMs is going to be fierce once the LLMs get to the level to question the basic assumptions of editors in their respective fields.
Edited: Edited a lot.
[1] GPT-5 is behind schedule (wsj.com) https://www.wsj.com/tech/ai/openai-gpt5-orion-delays-639e7693
[2] Excerpt: I've never gotten an answer from an LLM to a tricky or obscure question about a subject I already know anything about that seemed remotely competent.