HACKER Q&A
📣 jairuhme

Why Build Models from scratch?


Hello - for those Data Scientists/MLEs who build models from scratch (i.e. say building a least squares model without using sklearn, etc.), why? Are there performance benefits if the pre-built model isn't built with a faster language or maybe restrictions on imports?


  👤 usgroup Accepted Answer ✓
If in “from scratch” you include custom models, then there are vast benefits mostly to do with including domain knowledge or with tackling a problem with more mechanical sympathy.

Here is an example from first principles:

https://emiruz.com/post/2023-10-15-logical-data-analysis/

It is a new perspective on an old dataset, made possible through a logico-symbolic analysis which includes some domain knowledge.

Other broad categories of examples would be MIP programs, and differentiable programs, which are less common then they should be.

More generally — and I say this as an experienced data scientist —- practitioners of our sport —- in my opinion —-often lack research expertise, and will more typically reason from tools to problems rather than from problems to tools. If the latter was more often the case, I think that custom models would be more common.


👤 ssivark
It’s possible to customize algorithms in so many ways (for anything ranging from performance, or domain correctness, or flexibility) — if you understand how it works under the hood, and can bend it to what you want. Doesn’t make sense for some generic library to build an interface for all that, or even try to be all things for all people.

Very rarely are you likely to find the best solution for your needs off-the-shelf, unless you have the exact same needs of the library authors (assuming also that they have access to more competence than you). IME, not often the case. And it’s turned out to be nicer (from the long term perspective — experimentation flexibility, maintainability, etc) to roll my own. Of course, one tries to pick a layer of the stack to build on top of (Eg: Numpy/Jax, or Julia, or whatever) instead of boiling the ocean.


👤 in9
Yep, customization is usually the answer. Or something that exists in one ecosystem that is not viable to deploy (looking at you R ecosystem).

For example, its possible, and common, to have a penalized linear model (or logistic model) where just a part of the features parameters to be fit are subjected to the penalization hyper paramenter. Or more complex and costumized loss functions, corresponding to zero inflated or truncated distributions. Those are very cool and sometimes fit to problems directly.


👤 opportune
Not only can there be the obvious kinds of performance benefits (no Python/extra process overhead, more focused optimized libraries, etc) to implementing something yourself, you can also implement nice features like rolling window regressions that aren’t compatible with the mostly offline/batch APIs used in OSS and which can be baked directly into the production system.

This is probably non-obvious to most newer practitioners unfamiliar with the broader systems their model operates in, because their mental models of what’s possible are constrained by the tools they’re familiar with.


👤 nextos
You can get huge performance benefits, where performance may also imply stability or convergence.

For example, recently I had to make a parametric fit of a large dataset to an exotic distribution. Hand-tuning maximum likelihood and rounding observations to lower precision, i.e. having a custom likelihood function, made it solvable and numerically stable.

Using e.g. Julia, or JAX, you can get access to lower level primitives that are composable and let you do this relatively quickly, without needing to code everything from scratch, just whatever you are interested in changing.


👤 qd011
> (i.e. say building a least squares model without using sklearn, etc.)

It's 2023, when I read "build models from scratch" I think about training a model from scratch and not using any pretrained models.

For your definition, there's no good use case apart from learning, or making a custom implementation for something that doesn't exist in a library.


👤 DamonHD
I just built a very small model from scratch because it made for a very simple (pageful of code) solution to test a simple claim, which I then expanded to cover some other cases. Performance was not relevant, simplicity was.