I am very much a beginner in the space of machine learning and have been overwhelmed by the choices available. Eventually I do want to simply want to build my own rig and just train models on that, but I don't have that kind of money right now, nor is it easy to find GPUs even if could afford them.
So I am basically stuck to cloud solutions for now, which is why I want to hear personal experiences of HN folks who have used any of the available ML platforms. Their benefits, short comings, which are more beginner friendly, cost effective, etc
I am also not opposed to configuring environments myself rather than using managed solutions (such as Gradient) if it is more cost effective to do so, or affords better reliability // better than average resource availability... because I read some complaints that Colab has poor GPU availability since shared among subscribers, and that the more you use it the less time is allocated to you... not sure how big of a problem it actually is though.
I am very motivated to delve into this space (it's been on my mind a while) and I want to do it right, which is why I am asking for personal experiences on this forum given that there is a very healthy mix of technology hobbyists as well as professionals on HN, of which the opinion of both is equally valuable to me for different reasons.
Also please feel free to include any unsolicited advice such as learning resources, anecdotes, etc,
Thanks for reading until the end.
While the (precious and useful) advice around seem to cover mostly the bigger infrastructures, please note that
you can effectively do an important slice of machine learning work (study, personal research) with just a battery-efficiency-level CPU (not GPU), in the order of minutes, on a battery. That comes before going to "Big Data".
And there are lightweight tools: I am current enamoured with Genann («minimal, well-tested open-source library implementing feedfordward artificial neural networks (ANN) in C», by Lewis Van Winkle), a single C file of 400 lines compiling to a 40kb object, yet well sufficient to solve a number of the problems you may meet.
https://codeplea.com/genann // https://github.com/codeplea/genann
After all, is it a good idea to use tools that automate process optimization while you are learning the deal? Only partially. You should build - in general and even metaphorically - the legitimacy of your Python ops on a good C ground.
And: note that you can also build ANNs in R (and other math or stats environments). If needed or comfortable...
Also note - reminder - that the MIT lessons of Prof. Patrick Winston for the Artificial Intelligence course (classical AI with a few lessons on ANNs) are freely available. That covers the grounds before a climb into the newer techniques.
My advice is go with Colab Pro ($50/mo) and TensorFlow/Keras. You can go with Pytorch too if you prefer.
I made the mistake of buying a 2080Ti for my desktop thinking it would be better, but no. Consumer grade hardware is nowhere near as good/fast as the server grade hardware you get in Colab. Plus you have the option to use TPUs in Colab if you want to scale up quickly.
You really don't need to get fancy with this setup. The best part of using Colab is you can work on your laptop from anywhere, and never worry about your ML model hogging all your RAM (and swap) or compute and slowing your local machine down. Trust me, this sucks when it happens, and you have to restart!
As for your data, you can host it in a GCS bucket. For small data (<1TB) even better is Google drive (I know, crazy). Colab can mount your Google drive and loads from it extremely quickly. It's like having a remote filesystem, except with a handy UI and collaboration options, and an easy way to inspect and edit your data.
I use a paperspace VM + Parsec for personal ML projects. Whenever I've done the math an hourly rate on a standard VM w/GPU is better than purchasing a local machine and the complexity of a workflow management tool for ML just isn't worth it unless you are collaborating across many researchers. As an added bonus, you can re-use these VMs for any hobby gaming you might do.
The majority of ML methods train quickly on a single large modern GPU for typical academic datasets. The scaling beyond 1 GPU or 1 host leads to big model research. While big models are a hot field, this is where you would need large institutional support to do anything interesting. A model isn't big unless it's > 30 GB these days :)
Even in a typical industrial setting, you'll find the majority of scientists using various python scripts to train and preprocess data on a single server. Data wrangling is the main component which requires large compute clusters.
As for software, I do everything with jax and tensorboard for viewing experiments. Jax is a phenomenal library for personal ml learning as its extremely flexible and has relatively low level composable abstractions.
I am biased towards using Keras and I suggest you bookmark these curated examples https://keras.io/examples/
I bought an at home GPU rig 3 years ago and I regret that decision. As many other people here have mentioned Google Colab is a great resource and will save you so much time because you will not be setting up your infrastructure. Start with the free version and when you really need to, switch to Pro or Pro+.
For more flexibility, set up a GPU VPS instance that you can stop when not in use to save money. I like GCP and AWS, but I used to use Azure and that is also a great service. When a VPS is in a stopped state, you only pay a little money for storage. I will sometimes go weeks without starting up my GPU VPS to run an experiment. Stick with Colab when it is good enough for what you are doing.
Now for a little off topic tangent: be aware that most knowledge work is in the process of being automated. Don’t be disappointed if things you spend time learning get automated away. Look at the value of studying new tech as being very transitory, and you will always be in the mode you are in right now: a good desire to learn new things. Also, think of deep learning in the context of using it for paid work to solve real problems. As soon as you feel ready, start interviewing for an entry level deep learning or machine learning job.
Learn Machine Learning first. Do not spend time on managing infra for ML while you are learning ML. Focus on learning ML first.
You can make decent cutting edge models and SOTA classic models just with free options. I am saying this because I have done this.
I suggest that you get Colab Pro after that.
AWS burns a hole on your pocket, and you should not spend money on that now. Although, AWS SageMaker is pretty tension-free experience.
I personally use GCP. I like the tooling around it to be the most convenient.
I suggest you learn the basics first. Learn classic ML, CNNs, RNNs, LSTM, Transformers, learn the necessary Maths, and even GANs if you are inclined.
If done in the right way, it will take you a 5/6 months to 18/20 months, depending on your time commitment, your current levels of grasp on programming and Math.
Do not rush or hurry.
When you reach that point, you can think of spending serious money for Deep Learning projects.
A few months back, I have gotten into TPUs, and these are fantastic. And GCP is my only option for these. I have only used TPUs for learning and personal project and never for work. I intend to keep it that way for a while.
This is my default go-to as a poor man ML setup, with environment and dependencies set up automatically via bash script on start up.
In terms of framework,, Pytorch seems to be better documented than Tensorflow and supports a more intuitive model for GPU/TPU compute in my opinion. It also natively supports complex number types when backpropagating so no need to implement your own. It also seems like Tensorflow has issues converting python code to the graph where Pytorch basically never has issues. It can take me 1/3 less time to program using Pytorch because of this. If you are using high-level interfaces, this shouldn't be an issues though.
Colab (and I believe Sagemaker) has free instances which have high power GPUs/TPUs. However, I prefer having access to a good graphical debugger so I develop on my local computer, then run large models on Colab. If you can afford it, I'd recommend a cheap, low power Cuda capable GPU for your local computer to develop the network, then use an IPython based cloud solution when memory/computer becomes limiting. They are also a fine place to start out. It's just having a graphical debugger can make you more productive.
Using Redshift to do a lot of the heavy lifting and initial data preparation, then SageMaker for hosting models and scoring, and Tableau for dashboards.
While you can do training within SageMaker, we have a cluster of EC2 instances using H2O libraries (xgboost) to train, then wrap the resulting model as a docker image and deploy it to ECR and link to a SageMaker endpoint.
Clunky and very much human-in-the-loop for training and deployment, but you can't run before you can crawl in this space.
A lot of end-to-end platforms are available nowadays that try to cover the entire lifecycle of a model from data prep, ETL, to training, serving, monitoring, operating. However, I found none of them really robust enough to cover all these cases perfectly, so I resorted to using different pieces from different vendors combined with my own stuff to make the entire platform suit my needs. This is still not perfect, though, and I think there's a lot of room for improvement in the space to enable really easy to use and scalable MLOps.
Still some of the tools I found to be ok: TensorFlow TFX, Kubeflow (to some extent - ops are a nightmare), Feast, MLFlow, GCP Vertex and AWS Sagemaker can get some work done, too.
Colab you is great for diving into examples that are already premade for colab.
Kaggle is better in my opinion in dataset handling, you can import public or upload your dataset with ease. They give you 30+ gpu hours for a week with ability to train your models in background. This can’t be done in Colab.
ML Azure platform is next level when you can pay for it. I’ve got credits from school. You can start experiments from python sdk with your own configurations, setup python environments, upload datasets, etc.
Look at some kind of AutoML framework like AutoGluon, then dive deeper on the components it uses once you've got through the initial setup process. AutoGluon will let you train some basic models with all the data cleaning and normalisation steps handled for you.
vast.ai has pretty low prices and gives you remote ssh into a GPU instance that you then have root on (albeit containerized).
Having a local GPU is effectively a requirement for doing "development" work (e.g. getting an architecture and/or codebase to the point where you would even be able to start training). Unfortunately, getting your own GPU is just absurdly expensive these days and probably not worth it. In the meantime, colab/kaggle/paperspace can be _okay_ as dev environments. Unfortunately, renting compute on vast.ai all day just to do occasional dev work gets expensive pretty quickly.
For something in-between vast.ai and AWS, datacrunch.io has slightly higher prices, with remote SSH into a server and a few more "niceties" that you get with a traditional cloud such as CPU instances and the ability to use those to pre-load data onto disk.
If and when you are able to get a GPU - just make sure to get nvidia as they have a stranglehold over the industry. The RTX cards are great - I've been doing tons of multimodal work on an RTX 2070 I bought pre-pandemic for around 350$. It only has 8 GiB of vram but is actually quite similar to a server-style V100 otherwise. I assume it probably costs like 2000$ these days.
If you're interested in the realm of running inference/training on giant models (say GPT-J 20B), you may find yourself in lack of VRAM. Using libraries like deepspeed, you can split the work across multiple GPU's. I highly recommend investing time to learning multi-GPU libraries or framework-provided features like pytorch's distributed data parallel as the size of models becomes a limiting factor very quickly in the case of transformers. A sibling comment mentions that you will need institutional support for training such models. This may be true, unfortunately. All I will say is that if you are even mildly competent, the demand for that type of work is increasing a lot lately.
Oh and yes, there is a new-ish site called replicate that I have been using to allow people to run inference on models that I've trained https://replicate.com/ without needing to be a coder. A lot of people use colab for this but that platform is annoying to support in practice.
For smaller projects, I generally find a Towhee pipeline (https://towhee.io/pipelines) that I then fine-tune on my 3080.
For general advice focused on beginners and ESPECIALLY practical, cheap and efficient methods and hacks to do DL, I recommend searching in https://www.fast.ai/ and their forums https://forums.fast.ai/
I'll try to search inside fast.ai if there is a more specific link to give. I know that one of their chief pieces of advice has been to use Colab and take advantage of the 300$ free credit you get (per credit card) when signing up to Google Cloud, which you can use for DL.
Disclaimer - I'm one of the creators of DagsHub, we created the platform especially to help people like you with the difficulties of managing things like data and model versioning, experiment tracking, labeling, etc. we'd love to have you onboard, and thanks for reading until the end :)
Anecdote: When I was taking the 'Computing For Data Science' class, we had a task to learn to use AWS tools like SageMaker, NLP bot or DeepRacer and present it in the class. The professor was also new to the whole AWS ecosystem. He opened many instances and left them running for a week which ended up taking 1000$ from bank account.. (Moral of the story: don't use aws with the card where all your money is)
Pytorch seems to be better documented than Tensorflow and supports a more intuitive way to use the GPU/TPU in my opinion. It also natively supports complex number types when backpropagating so no need to implement your own. It also seems like Tensorflow has issues converting python code to the graph where Pytorch basically never has issues. If you are using high-level interfaces, this shouldn't be an issues though.
Colab (and I believe Sagemaker) has free instances which have high power GPUs/TPUs. However, I prefer having access to a good graphical debugger so I develop on my local computer, then run large models on Colab. If you can afford it, I'd recommend a cheap, low power Cuda capable GPU for your local computer to develop the network, then use the IPython, cloud based solutions when memory/computer becomes limiting. They are also a fine place to start out. It's just having a graphical debugger can make you more productive.
You can give it a try as well: https://deploif.ai (It says paid on the website, but just get on our Discord and message me). The platform now supports GCP and Azure as well. I am happy to guide you through as well. It's not complete, but in case you choose to go ahead with cloud, this could help you out :)
We'd also be happy to have someone try the tool!
The problem with colab IMO is that if it's your main platform, you'll be pushed to use notebooks for everything which is not really a good practice. Whatever you use, I'd suggest focusing on building a real train.py script (I'm assuming you'll be using python) that takes command line arguments for the hyperparameters. Don't get sloppy and just have things run as a bunch of cells.
If you are learning, my unsolicited advice is don't use built in datasets, make sure you can write datasets / dataloaders yourself so you understand what is going on and can adapt to your own work. All the stock examples using built in mnist or whatever gloss over the most important parts of setting up the data
As an example: one of the reasons why I don't use Kubeflow is because it requires having a Kubernetes cluster up and running, which is an overkill in many cases.
Check out the project I'm working on: https://github.com/ploomber/ploomber
Finances aside, it's really nice being able to iterate locally on things like training/inference pipelines and model serving. My work is more toward the ML engineering space than it is research, so I don't spend much time in Colab.
I personally find most cloud providers annoying to use for personal use. You have to ask for permission to get access to a GPU that's not any better than what you get in the free with Colab. Then, there's all sorts of configuration you have to do. Colab is much easier and basically zero wait time to go from logging in to starting to run code.
At work we use Databricks, which is too expensive for personal use.
I have no affiliation with them whatsoever :) Just a fan of what they’re doing.
I think you are getting side tracked by a bunch of people at a car show with their hood popped checking out the custom chrome engines each other have. It is a bit pointless to worry about if you don't even know how to drive yet.
For the actual deployment in production, the only thing that's really affordable is if you send your own GPU workstations to a colocation hosting company. But that's a lot of work.
both of them work great for scratch projects.
https://elbo.ai - Train more. Pay less
We want to make ML tasks as cheap and as easy as possible. We can provision GPU nodes from multiple cloud providers (today we have 4 - TensorDock, AWS, Linode and FluidStack). You don't have to sign up with them, manage keys or passwords, AMI Images, VPCs, Subnets, Firewall rules, EBS volumes or worry about Colab closing your session, network transfer bills, GPU usage approvals, opening ports, billing surprises. We take care of all that and let you focus on learning ML.
I faced the same problem when I started learning ML and tried different cloud providers, Colab, Paperspace, custom PC with RTX30 series GPU. Most of the solutions were either very expensive or very complicated. I started building a tool for myself to deploy GPU nodes with a single command and thought it would be a nice product to have for other ML learners like me.
1. Sign up at https://elbo.ai for the free tier.
2. `pip3 install elbo`
3. `elbo login` with your token (from signup)
4. Jupyter Notebook in a single command in under 4 minutes (typically)- `elbo notebook`
5. Setup a GPU node to work remotely over SSH using `elbo create`
6. Submit ML tasks defined in a YAML file using `elbo run --config `
Quick start guide - https://docs.elbo.ai/quick-startCLI Reference - https://docs.elbo.ai/reference/cli-reference
Looking at our inventory today, you can get a decent Quadro 4000 GPU with 16 CPU and 32 GB memory for about $0.61 an hour.
PRICE GPU CPU MEM GPU-MEM
$ 0.2700/h Tesla K80 4 61Gb 12Gb AWS (spot)
$ 0.6100/h Quadro 4000 16 32Gb 8Gb TensorDock
$ 0.9000/h Tesla K80 4 61Gb 12Gb AWS
$ 0.9180/h V100 8 61Gb 16Gb AWS (spot)
$ 0.9200/h Quadro 5000 2 4Gb 16Gb FluidStack
$ 0.9600/h A5000 2 16Gb 24Gb TensorDock
$ 1.4900/h A4000 12 64Gb 16Gb FluidStack
$ 1.4940/h A40 2 12Gb 48Gb TensorDock
$ 1.5000/h Quadro 6000 8 32Gb 0Gb Linode
$ 1.5140/h A6000 2 16Gb 48Gb TensorDock
$ 2.1600/h 8x Tesla K80 32 488Gb 12Gb AWS (spot)
$ 3.0000/h 2x Quadro 6000 16 64Gb 0Gb Linode
$ 3.0600/h V100 8 61Gb 16Gb AWS
$ 3.6720/h 4x V100 32 244Gb 16Gb AWS (spot)
$ 3.7460/h 7x V100 6 8Gb 16Gb TensorDock
$ 4.3200/h 16x Tesla K80 64 732Gb 12Gb AWS (spot)
$ 4.5000/h 3x Quadro 6000 20 96Gb 0Gb Linode
$ 6.0000/h 4x Quadro 6000 24 128Gb 0Gb Linode
$ 7.3440/h 8x V100 64 488Gb 16Gb AWS (spot)
$ 7.9200/h 8x Tesla K80 32 488Gb 12Gb AWS
$ 9.8318/h 8x A100 96 1152Gb 80Gb AWS (spot)
$13.0360/h 4x V100 32 244Gb 16Gb AWS
$14.4000/h 16x Tesla K80 64 732Gb 12Gb AWS
$24.4800/h 8x V100 64 488Gb 16Gb AWS
$32.7726/h 8x A100 96 1152Gb 80Gb AWS
If you just need a dedicated machine on the cloud, then I would highly recommend our provider - Tensordock (https://tensordock.com/). They have a good range of ML capable GPUs and are cheaper than many other cloud providers.We are just getting started, so if you hit any glitches or bugs, please email us at hi@elbo.ai
Thanks for reading till here and for your time!
EDIT: Updated formatting.