What do you use for ML Hosting?

Question

I'm trying to setup server to run ML inferences. I need to provision a somewhat beefy gpu with a decent amount of RAM (8-16 GB). Does anyone here have personal experience and recommendations about the various companies operating in this space?

thundergolfer · Accepted Answer

On Modal.com these 34 lines of code is all you need to serverlessly run BERT text generation inference on an A10G (which has 24GB of GPU memory). No Dockerfile, no YAML, no Terraform or AWS Cloudformation. Just these 34 lines.

  import modal

  def download_model():
      from transformers import pipeline
      pipeline("fill-mask", model="bert-base-uncased")

  CACHE_PATH = "/root/model_cache"  # model location in image
  ENV = modal.Secret({"TRANSFORMERS_CACHE": CACHE_PATH})

  image = (
      modal.Image.debian_slim()
      .pip_install("torch", "transformers")
      .run_function(download_model, secret=ENV)
  )
  stub = modal.Stub(name="hn-demo", image=image)


  class Model:
      def __enter__(self):
          from transformers import pipeline
          self.model = pipeline("fill-mask", model="bert-base-uncased", device=0)

      @stub.function(
          gpu="a10g",
          secret=ENV,
      )
      def handler(self, prompt: str):
          return self.model(prompt)


  if __name__ == "__main__":
      with stub.run():
          prompt = "Hello World! I am a [MASK] machine learning model."
          print(Model().handler.call(prompt)[0]["sequence"])

Running `python hn_demo.py` prints "Hello World! I am a simple machine learning model."

You can check out available GPUs at https://modal.com/docs/reference/modal.gpu.

There's also a bunch of easy-to-run examples in our docs :) https://modal.com/docs/guide/ex/stable_diffusion_cli

edunteman · Answer

Hey! Would love to have you try https://banana.dev (bias: I'm one of the founders). We run A100s for you and scale 0->1->n->0 on demand, so you only pay for what you use.I'm at erik@banana.dev if you want any help with it :)

howon92 · Answer

Here are some candidates: - HuggingFace Inference Endpoints: https://huggingface.co/inference-endpoints - Amazon SageMaker: https://aws.amazon.com/sagemaker/ - Replicate: https://replicate.com/The first two are more customizable than the last. SageMaker is the cheapest.

version_five · Answer

My preference is not to have to change my code to use some special framework, and just get access to a gpu machine I can run my stuff on.
I'm assuming you know what you need for a GPU. If you're unsure, consider trying to run inferences on a CPU and see how long it takes and if it could work.
And then just look at price and reliability for a gpu machine with the different cloud providers. Ovh is cheap but the only thing worse than their reliability is their customer service. Various niche players offering V100s used to pop up that were pretty cheap. AWS is more expensive, more reliable, they may still have availability problems. Paperspace looks pretty good. Etc.

chiragjn · Answer

Disclaimer: I work at TruefoundryYou can give us a shot at https://truefoundry.com We are a general purpose ML Deployments platform which works on top of your existing Kubernetes clusters (AWS EKS, GCP GKE or Azure AKS) abstracting away the complexity of dealing with cloud providers and Kubernetes. We support Services for ML web apps, APIs, Jobs for ML training jobs, Model Registry for storing models, Model Servers for no code model deployments. (Our platform can be partially or completely self hosted for privacy and compliance)Adding one or more GPUs (V100, T4, A10, A100, etc) is simply one extra line https://docs.truefoundry.com/docs/gpus#adding-gpu-to-service...Examples:- Stable Diffusion with Gradio: https://github.com/truefoundry/truefoundry-examples/tree/mai...- GPT-J 6B fp16 with FastAPI: https://github.com/truefoundry/truefoundry-examples/tree/mai...

tikkun · Answer

For serverless: check the list I posted here https://news.ycombinator.com/item?id=34742087 (I ended up using Banana, it was fine)
For non-serverless, some to check out are these (though likely all overkill if you just need a single GPU)
https://www.coreweave.com/
vast.ai
Lambda labs

_boffin_ · Answer

I&rsquo;m using a docker container on Ubuntu, which is on my home lab that&rsquo;s an esxi 6.5 hypervisor. Going to be building a new machine with a few hundred GB of ram and then, at some point in the next 6 months, looking at getting a good GPU with a bunch of vRAM.Wrapped the thing in a flask app so I can expose APIs I build out.

smoldesu · Answer

I'm currently running a Discord bot with a 7B model off a free Oracle Ampere instance with their Pytorch Accelerated[0] image. It's not terribly fast, but totally usable for group chats that want to interrogate an AI. If you're doing some sort of offline processing or non-time-imperative operation, something like this might be worth looking into.[0] https://cloudmarketplace.oracle.com/marketplace/en_US/adf.ta...

GC_tris · Answer

Genesis Cloud (https://www.genesiscloud.com/pricing).
Disclaimer: I am the CTO ;)
Why use us?
Competitive prices (billing by the minute, only pay when you actually run an instance). High reliability (professional DCs, customized hardware to suit requirements). Good connectivity (traffic is also free, no in-/egress fees). High security level (full VMs with dedicated GPUs with proper separation of customers instead of shared hosts with docker). Free storage. A great support team. Green energy (no greenwashing by carbon offsetting, we use energy sources that are renewable and carbon free at the source (geothermal/hydro)).
I could go on... Would love it if you just try our services, after sign up there are free credits available for risk free testing.

asadm · Answer

I have had good experience with Replicate and Runpod. Replicate seems to be nicer but has very bad cold boot issue. Runpod is great once you have an app set up!I use mix of both for my side project: https://trainengine.ai

psshank · Answer

Try www.salad.com. We've got 10k+ GPUs - from 8GB to 24GB. You get 10x more inferences per dollar compared to others. Our product team is pretty happy to help out on Discord. Some prices of interest. RTX 3060 - 12 GB - $0.08/hr RTX 3090 - 24 GB - $0.25/hr

rgbrgb · Answer

Wow, looks like there's a ton of choices here I haven't looked at. For iterate.world we use replicate but just added kandinsky from runpod. Thinking about switching everything to runpod because it's 5-10x cheaper and we only use models that they have anyway.
There's one I won't share that's is now defunct but you could use any diffuser's compatible project on Hugging Face, which was such a cool feature. I wish someone (cheap) would implement this!
edit: just looked at banana.dev in this thread, their templates look closest to the HuggingFace integration though I don't think they have webhooks.

languagehacker · Answer

Vultr GPU: https://www.vultr.com/products/cloud-gpu/

RGDub · Answer

I've been very happy with Genesis Cloud (www.genesiscloud.com) - they have worked with me on getting additional GPU capacity and have very reasonable prices. 0.7 USD/HR for a Nvidia GeForce RTX 3090. They give you $15 in credits for starting an account but you can get $50 with this referral code: https://gnsiscld.co/x5tpz .

Areibman · Answer

Baseten was by far the easiest setup I've tried https://www.baseten.co

lordofgibbons · Answer

Do any of the "serverless"/saas model hosting services perform optimizations such as quantization or input micro-batching?

pj_mukh · Answer

If you&rsquo;re using python, Modal (modal.com) was awesome to setup.They&rsquo;ll take a FastAPI setup too and just put it online to be used on demand.

zitterbewegung · Answer

Have you tried self hosting? All you need is business internet with a static IP which is quite inexpensive and doing inference can be done on CPU depending what you want to perform inferfence. Also, wherever you are hosting a good rule of thumb is to have at least 1.5 times the amount of regular ram with your VRAM.

joshhart · Answer

I'm the director of engineering for Databricks' model serving product. It is serverless, meaning it autoscales to & from zero. If you are a Databricks customer or willing to be, you can reach out about enrolling in the GPU preview.

bfirsh · Answer

Founder of https://replicate.com/ here, which has been mentioned a few times. Happy to help you get set up. :) ben@replicate.com

outdoorblake · Answer

Banana.dev is what I use. The cold boots are fast

jetml · Answer

Check out JetML.com (I'm the founder). Happy to help get you started with a demo if you want to reach out nick@jetml.com.

chaoyu_ · Answer

Check out BentoML https://github.com/bentoml

sa-code · Answer

What about just using a cloud VM with an ansible script? I find ML deployment solutions to be very over engineered

pythops · Answer

Nvidia jetson nano boards (orin and the previous one) at home. The cloud is so expensive for gpu usage

efxhoy · Answer

We use Sagemaker at work because AWS. I don't really like their style of APIs but it works.

tehsauce · Answer

Vast.ai Nobody has better prices.

jvanillaaaa · Answer

Brev.devThis is exactly what you&rsquo;re looking for

sjkoelle · Answer

if you want to host voice ML models, check out Uberduck.

ihgautam · Answer

KFServing

What do you use for ML Hosting?

I'm trying to setup server to run ML inferences. I need to provision a somewhat beefy gpu with a decent amount of RAM (8-16 GB). Does anyone here have personal experience and recommendations about the various companies operating in this space?

Hey! Would love to have you try https://banana.dev (bias: I'm one of the founders). We run A100s for you and scale 0->1->n->0 on demand, so you only pay for what you use.
I'm at erik@banana.dev if you want any help with it :)

Here are some candidates: - HuggingFace Inference Endpoints: https://huggingface.co/inference-endpoints - Amazon SageMaker: https://aws.amazon.com/sagemaker/ - Replicate: https://replicate.com/
The first two are more customizable than the last. SageMaker is the cheapest.

For serverless: check the list I posted here https://news.ycombinator.com/item?id=34742087 (I ended up using Banana, it was fine)
For non-serverless, some to check out are these (though likely all overkill if you just need a single GPU)
https://www.coreweave.com/
vast.ai
Lambda labs

I have had good experience with Replicate and Runpod. Replicate seems to be nicer but has very bad cold boot issue. Runpod is great once you have an app set up!
I use mix of both for my side project: https://trainengine.ai

Try www.salad.com. We've got 10k+ GPUs - from 8GB to 24GB. You get 10x more inferences per dollar compared to others. Our product team is pretty happy to help out on Discord. Some prices of interest. RTX 3060 - 12 GB - $0.08/hr RTX 3090 - 24 GB - $0.25/hr

Vultr GPU: https://www.vultr.com/products/cloud-gpu/

Baseten was by far the easiest setup I've tried https://www.baseten.co

Do any of the "serverless"/saas model hosting services perform optimizations such as quantization or input micro-batching?

If you’re using python, Modal (modal.com) was awesome to setup.
They’ll take a FastAPI setup too and just put it online to be used on demand.

I'm the director of engineering for Databricks' model serving product. It is serverless, meaning it autoscales to & from zero. If you are a Databricks customer or willing to be, you can reach out about enrolling in the GPU preview.

Founder of https://replicate.com/ here, which has been mentioned a few times. Happy to help you get set up. :) ben@replicate.com

Banana.dev is what I use. The cold boots are fast

Check out JetML.com (I'm the founder). Happy to help get you started with a demo if you want to reach out nick@jetml.com.

Check out BentoML https://github.com/bentoml

What about just using a cloud VM with an ansible script? I find ML deployment solutions to be very over engineered

Nvidia jetson nano boards (orin and the previous one) at home. The cloud is so expensive for gpu usage

We use Sagemaker at work because AWS. I don't really like their style of APIs but it works.

Vast.ai Nobody has better prices.

Brev.dev
This is exactly what you’re looking for

if you want to host voice ML models, check out Uberduck.

KFServing

What do you use for ML Hosting?

I'm trying to setup server to run ML inferences. I need to provision a somewhat beefy gpu with a decent amount of RAM (8-16 GB). Does anyone here have personal experience and recommendations about the various companies operating in this space?

Hey! Would love to have you try https://banana.dev (bias: I'm one of the founders). We run A100s for you and scale 0->1->n->0 on demand, so you only pay for what you use.I'm at erik@banana.dev if you want any help with it :)

Here are some candidates: - HuggingFace Inference Endpoints: https://huggingface.co/inference-endpoints - Amazon SageMaker: https://aws.amazon.com/sagemaker/ - Replicate: https://replicate.com/The first two are more customizable than the last. SageMaker is the cheapest.

For serverless: check the list I posted here https://news.ycombinator.com/item?id=34742087 (I ended up using Banana, it was fine)For non-serverless, some to check out are these (though likely all overkill if you just need a single GPU)https://www.coreweave.com/vast.aiLambda labs

I have had good experience with Replicate and Runpod. Replicate seems to be nicer but has very bad cold boot issue. Runpod is great once you have an app set up!I use mix of both for my side project: https://trainengine.ai

Try www.salad.com. We've got 10k+ GPUs - from 8GB to 24GB. You get 10x more inferences per dollar compared to others. Our product team is pretty happy to help out on Discord. Some prices of interest. RTX 3060 - 12 GB - $0.08/hr RTX 3090 - 24 GB - $0.25/hr

Vultr GPU: https://www.vultr.com/products/cloud-gpu/

Baseten was by far the easiest setup I've tried https://www.baseten.co

Do any of the "serverless"/saas model hosting services perform optimizations such as quantization or input micro-batching?

If you’re using python, Modal (modal.com) was awesome to setup.They’ll take a FastAPI setup too and just put it online to be used on demand.

I'm the director of engineering for Databricks' model serving product. It is serverless, meaning it autoscales to & from zero. If you are a Databricks customer or willing to be, you can reach out about enrolling in the GPU preview.

Founder of https://replicate.com/ here, which has been mentioned a few times. Happy to help you get set up. :) ben@replicate.com

Banana.dev is what I use. The cold boots are fast

Check out JetML.com (I'm the founder). Happy to help get you started with a demo if you want to reach out nick@jetml.com.

Check out BentoML https://github.com/bentoml

What about just using a cloud VM with an ansible script? I find ML deployment solutions to be very over engineered

Nvidia jetson nano boards (orin and the previous one) at home. The cloud is so expensive for gpu usage

We use Sagemaker at work because AWS. I don't really like their style of APIs but it works.

Vast.ai Nobody has better prices.

Brev.devThis is exactly what you’re looking for

if you want to host voice ML models, check out Uberduck.

KFServing

Hey! Would love to have you try https://banana.dev (bias: I'm one of the founders). We run A100s for you and scale 0->1->n->0 on demand, so you only pay for what you use.
I'm at erik@banana.dev if you want any help with it :)

Here are some candidates: - HuggingFace Inference Endpoints: https://huggingface.co/inference-endpoints - Amazon SageMaker: https://aws.amazon.com/sagemaker/ - Replicate: https://replicate.com/
The first two are more customizable than the last. SageMaker is the cheapest.

For serverless: check the list I posted here https://news.ycombinator.com/item?id=34742087 (I ended up using Banana, it was fine)
For non-serverless, some to check out are these (though likely all overkill if you just need a single GPU)
https://www.coreweave.com/
vast.ai
Lambda labs

I have had good experience with Replicate and Runpod. Replicate seems to be nicer but has very bad cold boot issue. Runpod is great once you have an app set up!
I use mix of both for my side project: https://trainengine.ai

If you’re using python, Modal (modal.com) was awesome to setup.
They’ll take a FastAPI setup too and just put it online to be used on demand.

Brev.dev
This is exactly what you’re looking for