From a live videostream, detect what keyboard keys the user pressed or what guitar fret/string the user played etc..
I imagine with the current state of AI, if I train a model with enough data consisting of videos of me playing a musical instrument along with the tab/sheet music equivalent, it should be able to transcribe this visually. Am I underestimating the complexity here?
If I were to build this from scratch, what tools would you recommend? Tensorflow comes to mind but it feels like with the Tsunami of AI developments in the past year, there's gotta be a better tool out there.
I imagine audio waveform is a magnitude easier than video recognition.
Video would be a pain because you would probably need to synchronise hand fretting position with picking/strumming timing to know which video frame actually 'struck' each string or chord, right?
Imagine finger style guitar or flamenco, the picking patterns are at least as complex as the fretted hand position.
Question - what is your root goal? Are you convinced video processing is the best way to achieve this?