HACKER Q&A
📣 willj

What Is the Point of Snorkel?


I’ve read a bit about Snorkel, and I listened to an episode of Software Engineering Daily [1] this week, and this question has been bugging me. In Snorkel, you define labeling functions to provide labels to lots of data, rather than manually label it, and then use this data to train machine learning models. My question is, why even train a machine learning model if you have the functions to classify a dataset? You essentially have a decision tree, and it seems silly to train a model from scratch.


  👤 btown Accepted Answer ✓
While I haven't used the Snorkel library proper, I've used automated features in machine learning models in a similar way before.

The key insight is: labeling functions are assumed to be noisy and inaccurate.

https://www.snorkel.org/use-cases/01-spam-tutorial is worth rereading with this in mind (IMO, it kind of buries the lede on why this is an important distinction in a hand-wavy way). It's all about diminishing returns: if you're writing a manual labeling function for production, you're going to quickly get to something that may work 50-80% of the time, and you'd spend a LOT more time on the edge cases. So let a machine learning model figure out the edge cases for you!

Snorkel ensures your labeling function isn't taken as gospel; the training process will do its best to follow its guidance, but it will be willing to say "this NotSpam label provided by the function was probably totally wrong in this case, because the rest of the text content in this example feels a LOT like all the other messages that the function labeled as Spam, and boosting (literally or figuratively) the strength of my conviction on my deep-text-analysis insights, which say that horrible grammar is more spammy than not, will make me perform better on the test set overall, even if I sacrifice this example."

And you can have formalisms for how these are labels with uncertainty attached, and feed this information to the pipeline. https://ajratner.github.io/ has a lot of peer-reviewed research on how this works, and https://link.springer.com/article/10.1007/s00778-019-00552-1... has a number of figures that may be illustrative as well.