How do you deploy and scale ML (±DL) models?

Question

Broad answers to the broad question are OK too, everyone will benefit from them, but more specifically:1. Why is there so little "unbiased" info about production deploying/serving ML models? (I mean except the official docs of frameworks like eg. TensorFlow which obviously suggest their mothership's own services/solutions.)2. Do you hand code microservices around your TF or Pytorch (or sklearn / homebrewed / "shallow" learning) models?3. Do you use TensorFlow Serving? (If so, is this working fine for you with Pytorch models too?)4. Is using Go infra like eg. Cortex framework common? (Keep reading about it, love the point and I'd love using static language here but not Java, but talked with noooone who's actually used it.)5. And going beyond the basics: is there any good established recipe for deploying and scaling models with dynamic re-training (eg. the user app expose something like a "retrain with params X + Y + Z" API action, callable in response to user actions - eg. the user control training too) that does not break horribly with more than tens of users?P.S. Links to any collections of "established best practices" or "playbooks" would be awesome!

tixocloud · Accepted Answer

1. By unbiased, do you mean opinionated? The MLOps industry is still in the very early stages and there’s no single standard. Every dev and company has come up with an implementation but there are so many tiny little use cases that sometimes forces new implementations to spring up. The closest standard you get is a Docker/Kubernetes flavour.
2. Handcoding to begin with is fine but as you start to scale the number of production models and actually productionalize the model at scale, it’s unfeasible and leads to plenty of maintenance issues. There are a few model infrastructure tools that help with this but again, many are homegrown because the market is still new. Algorithmia, Seldon are pretty good starts.
3. Rarely use serving options provided as the challenge is integrating it with the rest of engineering. Service monitoring gets handled by different teams.
4. Depends on the industry and usecase. Again integrating and maintenance comes into play. Go/Cortex might make sense but a lot of companies leverage Spark so Scala/Java could be the choice for production models.
5. We’re working on creating this recipe for enterprises. I believe Seldon (open source) might contain this capability. The challenge as you pointed out is ensuring things don’t break!

calebkaiser · Answer

Cortex contributor here/the guy who wrote that article about using Go. The project is on the young side, so we don't have the "footprint" of older projects yet, but if you want to talk to people deploying models with Cortex I'd recommend checking out our Gitter channel: https://gitter.im/cortexlabs/cortexAll of our core contributors + a good number of users are in there, and we're all happy to chat.

Jugurtha · Answer

We're building an internal platform for that. Description in my profile. Please get in touch if you'd like to know more or play with it, we'd love your feedback. We have given access to about thirty students to prepare their final year projects in vision, NLP, etc.
We've been doing consulting for more than six years and we're building a platform precisely to solve the problems we have encountered and you are writing about. We have learned some things that we are encoding in the platform, in case you want to build your own. We have started doing this because we hit a ceiling on the projects we could do, and we were under stress. We're a tiny, tiny team.
The problems are in interfaces between different roles, with each role having a stack with a gazillion tools, and a different "language" they speak and universe they live in. The stitching of people's interaction together, the workflow, the business problems, and the fragmented tooling is problematic. The inflexibility of said tooling and frameworks that you addressed also made us not be able to use them, or other platforms. This is why we are working hard to build a coherent, integrated experience, while still trying to bulid abstractions that allow us to substitute tools and view the tools as simple components, not to be tied.
For now, it allows you to create a notebook from several images with most libraries pre-installed. The infra it's deployed on allows Tesla K80 which you can use. You can of course install additional libraries.
This solves the problem of setting the environment, CUDA, docker engine, runtime versions, and the usual yak shaving. We're only using JupyterHub and JupyterLab for Python notebooks for now, as it is what our colleagues use, but we plan to support more.
It also solves the problem of the "it works on my machine" and running a colleague's notebook.
You can click on a button and publish an AppBook and share it with a domain expert right away to play with. It is automatically parametrized for you so you don't play with widgets, and automatically generates form fields for parameters. The parameters, metrics are tracked behind the scenes without you doing anything, and the models are saved to object storage. Again, one role we target is the ML practitioner who does not necessarily remember to do these things, so we do it for them.
Here's a video from a very early version: https://app.box.com/s/mwsw79g3d5b974o625f1mw979cc4znf0
We're using MLFlow for that, but plan to support GuildAI, and Cortex. We think hard to make things loosely coupled and configurable, so you get to pick the stack and easily integrate the platform with existing stack.
The AppBook is super useful in that you can publish it and then use it to train the model, or share it with a domain expert so they can play with different parameters. One of the problems we've seen was that some features are considered unimportant for an ML practitioner, but are critical to domain experts.
Thightening that feedback loop from notebook to domain expert makes the one click AppBook important because it saves you scheduling meetings and how to "show" the domain expert the work, while allowing them to interact with it.
You can also deploy models you choose with one click and it will give you an endpoint and generate a tutorial on how to hit that endpoint to invoke the model with curl or Python requests. You can generate a token and invoke the model in other places or services.
This self service feature is important because it allows an ML practitioner to "deploy" their own model, without asking a colleague to do so who might be doing other things. Self service is super important through this.
Right now, we're focusing on fixing bugs and improving tests and have added monitoring before going back to feature development. Some features we were working on were a more flexible and scalable model deployment strategies, monitoring, collaboration, retraining, and data streams, and building the SDK.