Regression to predict long-tailed variables?

Question

Suppose I was trying to predict something like the number of votes an article gets on social media. I have a lot of data, millions of article submissions, but the data is noisy and has a large range that is nowhere near normally distributed.For instance, the same article might get submitted 5 times to the same web site and get 0, 0, 9, 35, and 172 votes. That&rsquo;s the output variable. The input variables are derived from the text of the article, these could be &ldquo;bag of words&rdquo; features, the output of a BERT-like embedding, or ideally, a fine-tuned model that outputs a number. I&rsquo;ve got models working that treat it as a classification model and predict if an article crosses a threshold, and those definitely &ldquo;work&rdquo; but I get the feeling they are throwing information away. On the other hand, the L2-norm and even the L1-norm don&rsquo;t seem appropriate here because of the nosiness and large range of the output variable.Is there a better way to do this?

reedmeyerson · Accepted Answer

Have you considered re-mapping the range of the output variable? You can map one-dimensional random variables onto each other by comparing their CDFs. (This works exactly for continuous random variables, but the technique can be approximated with discrete random variables).

tlb · Answer

Predict log(votes + 1) instead.