What do you use to run ML/DL in prod?

Question

I know there are several options to serve your ML models in production? Now I'm going through MLOps course from Andrew Ng (https://www.coursera.org/specializations/machine-learning-engineering-for-production-mlops) Theoretical part is wonderful, however part about Tensoflow (TFX, Transform) is really a pain.So question: - Which stack do you use to serve ML in prod properly - with CI/CD, Monitoring scaling - If you use TF stack, what do you think about it?Thanks!

ra-mos · Accepted Answer

A lot of this is dependent on what &ldquo;prod&rdquo; means. If production applications, you are largely just retrofitting devops processes to incorporate model apis. Nothing really changes from how you deploy/monitor your systems.Now, production models probably means some advanced domain-specific use case. The problem with all these MLOps platforms, or model serving services, is that they are abstractions for general use cases. Capitalism effect, I wouldn&rsquo;t buy into these quite yet.To serve production models, you&rsquo;re going to need to get low-level, especially if performance is a feature. You&rsquo;ll need to figure out architecture designs that work best for your use case.Eg, I serve NLP models to augment a complex enterprise search engine. These are large TensorFlow models, with large embedding spaces.We use kubernetes, ssd optimizations for fast embedding retrieval, and custom compiled TensorFlow container images. We then have sort of a demux architecture to custom-batch every request into sub requests (100s-1000s) that fan out for inference across kubernetes. Almost every api is in Golang, and we replaced Python with Go to run the models. Flatbuffers for every request and to spend no cost in serialization as we turn every user request into 100s that return back as a single request in ~1s. Some CPU optimizations along the way, and now we&rsquo;re happy with the current prod infrastructure.AFAIK, no MLOps tool could&rsquo;ve done this for us. The major new thing we do is capture every piece of metric we can, and incorporate UX as part of ML research and retraining.