HACKER Q&A
📣 PaulHoule

Regression to predict long-tailed variables?


Suppose I was trying to predict something like the number of votes an article gets on social media. I have a lot of data, millions of article submissions, but the data is noisy and has a large range that is nowhere near normally distributed.

For instance, the same article might get submitted 5 times to the same web site and get 0, 0, 9, 35, and 172 votes. That’s the output variable. The input variables are derived from the text of the article, these could be “bag of words” features, the output of a BERT-like embedding, or ideally, a fine-tuned model that outputs a number. I’ve got models working that treat it as a classification model and predict if an article crosses a threshold, and those definitely “work” but I get the feeling they are throwing information away. On the other hand, the L2-norm and even the L1-norm don’t seem appropriate here because of the nosiness and large range of the output variable.

Is there a better way to do this?


  👤 reedmeyerson Accepted Answer ✓
Have you considered re-mapping the range of the output variable? You can map one-dimensional random variables onto each other by comparing their CDFs. (This works exactly for continuous random variables, but the technique can be approximated with discrete random variables).

👤 tlb
Predict log(votes + 1) instead.