Layers: 24
Hidden Size: 1024
Attention Heads: 16
Using the formulas:
Neurons = Hidden Size * Attention Heads * Layers
Parameters = Attention Heads * (Hidden Size^2 / Attention Heads) * Layers
Plugging in the values:
Neurons = 1024 * 16 * 24 = 393,216
Parameters = 16 * (1024^2 / 16) * 24 = 345,471,744
Therefore, for the GPT-2 Medium model, the number of neurons is 393,216, and the number of parameters is 345,471,744.
Parameters = A * (H^2 / A) * L
Is that OK?