How do you track NLP models in prod?
Working on a service which uses GPT-3.5 and was wondering how I would go about tracking the usefulness of it. Wondering what you guys have done, or if you guys even track your NLP models at all.
Are you doing classification, regression, extraction or something different? How did you evaluate the model before it went into production?