Now I will film myself performing simple actions, like, say, a pushup, or perhaps throwing a punch, several dozen times. I will process these videos in order to export the frame-by-frame positions of each joint.
What's the easiest way for me to take these time-series arrays and train a small neural network on them, so that given a new video, or rather, the last N frames of a live video feed, I can detect the action that was performed, if any?
I can already imagine how to preprocess/normalize the training data. I just need someone to point me in the right direction to learn how to train a simple model and perform inference.
I am using Python.
Thanks for any help!
The pose estimation converts the video of the body's movement into a time series where each frame is represented by an array of 33 landmarks (joints, basically). Each landmark is represented by normalized x, y, and z (between -1 and 1, or between 0 and 1; up to me) values, in addition to a visibility score (0-1). To keep it simple, all videos will feature a single subject taken from the same position and angle and with the subject standing in the same position and facing the same direction.
As far as classification/output, I just want booleans or confidence scores that indicate which action(s) were likely performed.
It looks like I'm looking for a tutorial on "Multivariate Time Series Classification". Is this correct?