Is meaningful ML research outside of major companies possible

Question

One of the things that have made software unique is the ease in which one can bootstrap their "lab". Typically one just needs a single commodity machine (caveat: embedded devices). Other disciplines (e.g. chemistry, biology, mechanical) can requires many millions of dollars of investment.And the larger costs are typically proportional to number of users etc, and so it can just scale with success.Modern ML/deep learning/etc seems uniquely different in that it 1. Requires access to enormous datasets, and 2. Requires enormous amounts of compute to train (that second point is in contrast to the scalability point, one needs to perform that enormous upfront cost even for serving a single user).1. Are any groups doing meaningful ML research without access to farms of H100? 2. Are freely available datasets as high quality as what might be available to larger companies?

talldayo · Accepted Answer

> 1. Are any groups doing meaningful ML research without access to farms of H100?
Sure, there's a lot of work to be done on the inferencing side of things. You're just about boned if you want to train a competitive model without competitive compute though.
> 2. Are freely available datasets as high quality as what might be available to larger companies?
There have been, for a while. The Pile is the one everyone points too, but smaller finetuning datasets are everywhere now. None of them are super useful unless you intend to train a model though.