Some hurdles I see:
- Github rate limits the GET requests so it doesn't seem possible to scrape all the source code on there. But maybe it can be crowdsourced like seti@home so 1000 people can install a program to get around this.
- Training the model. I would imagine this would be hardest as it would need millions of dollars for this? Is there a way to get around it or using free tools like colab?
- Running the api. Once the model is trained, would it be possible to run it on a lenovo type laptop? I guess you need lots of VRAM to run it?
Final question is will a home brewed version be just as good? What factors determine that?
Just curious on how we can do it as I imagine there a lot of ML experts here.
Training the model would be expensive but it’s a one-and-done process. With the model openly available cloud providers could provide a subscription service to end-users which recoups the cost of running it.
The only issue is I imagine GitHub has much more code than 3TB.
This is the same reason people can’t easily “play with” GPT like models.
> would it be possible to run it on a lenovo type laptop?
No.
You might, with a hybrid Mac book pro M1 or M2 with 64GB of combined memory; pretty much any other lapto, categorically no.
You’d have to rent / own a separate server with epic GPU power.
> Final question is will a home brewed version be just as good?
No.
The open source language models are not as good as GPT3.
Code assist AI does no attribution.
This removes engagement between the dev and library authors. this ruins chances of engaging new contributors over time, eroding and killing the FOSS communities.
Code assist AI also does not care about licenses. See [1]
1: https://www.bleepingcomputer.com/news/security/microsoft-sue...