HACKER Q&A
📣 blululu

What do you use for ML Hosting?


I'm trying to setup server to run ML inferences. I need to provision a somewhat beefy gpu with a decent amount of RAM (8-16 GB). Does anyone here have personal experience and recommendations about the various companies operating in this space?


  👤 thundergolfer Accepted Answer ✓
On Modal.com these 34 lines of code is all you need to serverlessly run BERT text generation inference on an A10G (which has 24GB of GPU memory). No Dockerfile, no YAML, no Terraform or AWS Cloudformation. Just these 34 lines.

  import modal

  def download_model():
      from transformers import pipeline
      pipeline("fill-mask", model="bert-base-uncased")

  CACHE_PATH = "/root/model_cache"  # model location in image
  ENV = modal.Secret({"TRANSFORMERS_CACHE": CACHE_PATH})

  image = (
      modal.Image.debian_slim()
      .pip_install("torch", "transformers")
      .run_function(download_model, secret=ENV)
  )
  stub = modal.Stub(name="hn-demo", image=image)


  class Model:
      def __enter__(self):
          from transformers import pipeline
          self.model = pipeline("fill-mask", model="bert-base-uncased", device=0)

      @stub.function(
          gpu="a10g",
          secret=ENV,
      )
      def handler(self, prompt: str):
          return self.model(prompt)


  if __name__ == "__main__":
      with stub.run():
          prompt = "Hello World! I am a [MASK] machine learning model."
          print(Model().handler.call(prompt)[0]["sequence"])

Running `python hn_demo.py` prints "Hello World! I am a simple machine learning model."

You can check out available GPUs at https://modal.com/docs/reference/modal.gpu.

There's also a bunch of easy-to-run examples in our docs :) https://modal.com/docs/guide/ex/stable_diffusion_cli


👤 edunteman
Hey! Would love to have you try https://banana.dev (bias: I'm one of the founders). We run A100s for you and scale 0->1->n->0 on demand, so you only pay for what you use.

I'm at erik@banana.dev if you want any help with it :)


👤 howon92
Here are some candidates: - HuggingFace Inference Endpoints: https://huggingface.co/inference-endpoints - Amazon SageMaker: https://aws.amazon.com/sagemaker/ - Replicate: https://replicate.com/

The first two are more customizable than the last. SageMaker is the cheapest.


👤 version_five
My preference is not to have to change my code to use some special framework, and just get access to a gpu machine I can run my stuff on.

I'm assuming you know what you need for a GPU. If you're unsure, consider trying to run inferences on a CPU and see how long it takes and if it could work.

And then just look at price and reliability for a gpu machine with the different cloud providers. Ovh is cheap but the only thing worse than their reliability is their customer service. Various niche players offering V100s used to pop up that were pretty cheap. AWS is more expensive, more reliable, they may still have availability problems. Paperspace looks pretty good. Etc.


👤 chiragjn
Disclaimer: I work at Truefoundry

You can give us a shot at https://truefoundry.com We are a general purpose ML Deployments platform which works on top of your existing Kubernetes clusters (AWS EKS, GCP GKE or Azure AKS) abstracting away the complexity of dealing with cloud providers and Kubernetes. We support Services for ML web apps, APIs, Jobs for ML training jobs, Model Registry for storing models, Model Servers for no code model deployments. (Our platform can be partially or completely self hosted for privacy and compliance)

Adding one or more GPUs (V100, T4, A10, A100, etc) is simply one extra line https://docs.truefoundry.com/docs/gpus#adding-gpu-to-service...

Examples:

- Stable Diffusion with Gradio: https://github.com/truefoundry/truefoundry-examples/tree/mai...

- GPT-J 6B fp16 with FastAPI: https://github.com/truefoundry/truefoundry-examples/tree/mai...


👤 tikkun
For serverless: check the list I posted here https://news.ycombinator.com/item?id=34742087 (I ended up using Banana, it was fine)

For non-serverless, some to check out are these (though likely all overkill if you just need a single GPU)

https://www.coreweave.com/

vast.ai

Lambda labs


👤 _boffin_
I’m using a docker container on Ubuntu, which is on my home lab that’s an esxi 6.5 hypervisor. Going to be building a new machine with a few hundred GB of ram and then, at some point in the next 6 months, looking at getting a good GPU with a bunch of vRAM.

Wrapped the thing in a flask app so I can expose APIs I build out.


👤 smoldesu
I'm currently running a Discord bot with a 7B model off a free Oracle Ampere instance with their Pytorch Accelerated[0] image. It's not terribly fast, but totally usable for group chats that want to interrogate an AI. If you're doing some sort of offline processing or non-time-imperative operation, something like this might be worth looking into.

[0] https://cloudmarketplace.oracle.com/marketplace/en_US/adf.ta...


👤 GC_tris
Genesis Cloud (https://www.genesiscloud.com/pricing).

Disclaimer: I am the CTO ;)

Why use us?

Competitive prices (billing by the minute, only pay when you actually run an instance). High reliability (professional DCs, customized hardware to suit requirements). Good connectivity (traffic is also free, no in-/egress fees). High security level (full VMs with dedicated GPUs with proper separation of customers instead of shared hosts with docker). Free storage. A great support team. Green energy (no greenwashing by carbon offsetting, we use energy sources that are renewable and carbon free at the source (geothermal/hydro)).

I could go on... Would love it if you just try our services, after sign up there are free credits available for risk free testing.


👤 asadm
I have had good experience with Replicate and Runpod. Replicate seems to be nicer but has very bad cold boot issue. Runpod is great once you have an app set up!

I use mix of both for my side project: https://trainengine.ai


👤 psshank
Try www.salad.com. We've got 10k+ GPUs - from 8GB to 24GB. You get 10x more inferences per dollar compared to others. Our product team is pretty happy to help out on Discord. Some prices of interest. RTX 3060 - 12 GB - $0.08/hr RTX 3090 - 24 GB - $0.25/hr

👤 rgbrgb
Wow, looks like there's a ton of choices here I haven't looked at. For iterate.world we use replicate but just added kandinsky from runpod. Thinking about switching everything to runpod because it's 5-10x cheaper and we only use models that they have anyway.

There's one I won't share that's is now defunct but you could use any diffuser's compatible project on Hugging Face, which was such a cool feature. I wish someone (cheap) would implement this!

edit: just looked at banana.dev in this thread, their templates look closest to the HuggingFace integration though I don't think they have webhooks.


👤 languagehacker

👤 RGDub
I've been very happy with Genesis Cloud (www.genesiscloud.com) - they have worked with me on getting additional GPU capacity and have very reasonable prices. 0.7 USD/HR for a Nvidia GeForce RTX 3090. They give you $15 in credits for starting an account but you can get $50 with this referral code: https://gnsiscld.co/x5tpz .

👤 Areibman
Baseten was by far the easiest setup I've tried https://www.baseten.co

👤 lordofgibbons
Do any of the "serverless"/saas model hosting services perform optimizations such as quantization or input micro-batching?

👤 pj_mukh
If you’re using python, Modal (modal.com) was awesome to setup.

They’ll take a FastAPI setup too and just put it online to be used on demand.


👤 zitterbewegung
Have you tried self hosting? All you need is business internet with a static IP which is quite inexpensive and doing inference can be done on CPU depending what you want to perform inferfence. Also, wherever you are hosting a good rule of thumb is to have at least 1.5 times the amount of regular ram with your VRAM.

👤 joshhart
I'm the director of engineering for Databricks' model serving product. It is serverless, meaning it autoscales to & from zero. If you are a Databricks customer or willing to be, you can reach out about enrolling in the GPU preview.

👤 bfirsh
Founder of https://replicate.com/ here, which has been mentioned a few times. Happy to help you get set up. :) ben@replicate.com

👤 outdoorblake
Banana.dev is what I use. The cold boots are fast

👤 jetml
Check out JetML.com (I'm the founder). Happy to help get you started with a demo if you want to reach out nick@jetml.com.

👤 chaoyu_
Check out BentoML https://github.com/bentoml

👤 sa-code
What about just using a cloud VM with an ansible script? I find ML deployment solutions to be very over engineered

👤 pythops
Nvidia jetson nano boards (orin and the previous one) at home. The cloud is so expensive for gpu usage

👤 efxhoy
We use Sagemaker at work because AWS. I don't really like their style of APIs but it works.

👤 tehsauce
Vast.ai Nobody has better prices.

👤 jvanillaaaa
Brev.dev

This is exactly what you’re looking for


👤 sjkoelle
if you want to host voice ML models, check out Uberduck.

👤 ihgautam
KFServing