Using AI to visually detect musical instrument note played

Question

I'm looking for existing tools/libraries or a guide to training your own model to do the following:From a live videostream, detect what keyboard keys the user pressed or what guitar fret/string the user played etc..I imagine with the current state of AI, if I train a model with enough data consisting of videos of me playing a musical instrument along with the tab/sheet music equivalent, it should be able to transcribe this visually. Am I underestimating the complexity here?If I were to build this from scratch, what tools would you recommend? Tensorflow comes to mind but it feels like with the Tsunami of AI developments in the past year, there's gotta be a better tool out there.

kingkongjaffa · Accepted Answer

If you know the tuning and scale length of the guitar, you can map the note played (audio) to string and fret position without trying to work that out from video?
I imagine audio waveform is a magnitude easier than video recognition.
Video would be a pain because you would probably need to synchronise hand fretting position with picking/strumming timing to know which video frame actually 'struck' each string or chord, right?
Imagine finger style guitar or flamenco, the picking patterns are at least as complex as the fretted hand position.

bjourne · Answer

For both keyboard and guitar finger movements can be very subtle and therefore difficult to discriminate. Probably easier if your camera can be mounted to give top-down view of your fingers while playing the keyboard. Probably much harder on guitar since getting a good view of how the fret is fingered is hard. Fingers may also hover over strings or lightly press strings to mute them.

gregmfoster · Answer

I imagine this would be brutally hard for guitar - so often, my fingers are "touching" the correct strings, but not pushing down hard enough to trigger the note. Piano might be more feasible because you can watch the motion of the keys themselves.Question - what is your root goal? Are you convinced video processing is the best way to achieve this?

speedgoose · Answer

Perhaps a fast fourrier transform (FFT) with a simple classifier would be enough.