I've been searching around and haven't found a clear standard/best way to do this.
Here are some of the options I've considered:
- Algorithmia (came across this yesterday, unsure how good it is and have some questions about the licensing)
- Something fancy with Kubernetes
- Write a load balancer and manually spin up new instances when needed.
Right now I'm leaning towards Algorithmia as it seems to be cost-effective and basically designed to do what I want. But I'm unsure how it handles long model loading times, or if the major cloud providers have similar services.
I'm quite new to this kind of architecture and would appreciate some thoughts on the best way to accomplish this!
Cortex automates all of the devops workâfrom containerizing your model, to orchestrating Kubernetes, to autoscaling instances to meet demands. We have a bunch of PyTorch examples in our repo, if you're interested: https://github.com/cortexlabs/cortex/tree/master/examples
Sagemaker takes care of infrastructure for you. It has also been integrated with various orchestrations like k8s, airflow etc.
https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_...
With Kubernetes, you can either wrap your model inside a container or mount it into the container from a persistent volume.
As for scaling you have two options:
1) Horizontal Pod Autoscaler https://kubernetes.io/docs/tasks/run-application/horizontal-...
2) Knative, which is Kubernetes serverless on-prem solution.
PyTorch? Dunno. Last I spoke to those ppl, they had a solution too.