We evaluated our search engine against the "gov2" dataset published by TREC nightly to know if there was any regression of our software. We had a much smaller proprietary test set for patents that we could run against our production database that would catch catastrophic failure but might not have caught a small regression.
In general any model ought to be evaluated before it gets put into production, if you do so the odds of it becoming a "deal with customers" problem are greatly reduced. I worked at another place that had a model training framework that would always run an evaluation cycle before publishing an updated model to the repository.