Are there best practices for comparing model performance beyond benchmark data when they may have different underlying datasets?
You can also break down by task here: https://paperswithcode.com/sota
For churn, you might go to time series forecasting first: https://paperswithcode.com/task/time-series-forecasting
They have this subtask which is a bit different because it's about novel products rather that continued sales, for example:
https://paperswithcode.com/task/new-product-sales-forecastin...
But you get the idea of how they organise by task. I'm curious about other benchmarks and interfaces too and would like to see others.
I think HuggingFace and Kaggle have some overlap with different tasks that have benchmarks.