How to train a neural network based on pose estimation landmarks?

Question

I am experimenting with pose estimation. In this case, the pose estimation uses your phone camera to detect the positions of your body's joints.Now I will film myself performing simple actions, like, say, a pushup, or perhaps throwing a punch, several dozen times. I will process these videos in order to export the frame-by-frame positions of each joint.What's the easiest way for me to take these time-series arrays and train a small neural network on them, so that given a new video, or rather, the last N frames of a live video feed, I can detect the action that was performed, if any?I can already imagine how to preprocess/normalize the training data. I just need someone to point me in the right direction to learn how to train a simple model and perform inference.I am using Python.Thanks for any help!

55555 · Accepted Answer

Additional information in case it's helpful:
The pose estimation converts the video of the body's movement into a time series where each frame is represented by an array of 33 landmarks (joints, basically). Each landmark is represented by normalized x, y, and z (between -1 and 1, or between 0 and 1; up to me) values, in addition to a visibility score (0-1). To keep it simple, all videos will feature a single subject taken from the same position and angle and with the subject standing in the same position and facing the same direction.
As far as classification/output, I just want booleans or confidence scores that indicate which action(s) were likely performed.
It looks like I'm looking for a tutorial on "Multivariate Time Series Classification". Is this correct?