I'm currently working to make an ML model written in R work on our backend system written in Java. After the dust settles I'll be looking for ways to streamline this process.
Shipping pickled models to other teams.
Deploying Sagemaker endpoints (too costly).
Requiring editing of config files to deploy endpoints.
What did work:
Shipping http endpoints.
Deriving api documentation from model docstrings.
Deploying lambdas (less costly than Sagemaker endpoints).
Writing a ~150 line python script to pickle the model, save a requirements.txt, some api metadata, and test input/output data.
Continuous deployment (after model is saved no manual intervention if model response matches output data).
Most of our tools are built in Rust. Several of those are for creating(cleaning) data streams out of datasets. They are converted into tensors or ROS messages.
1. Use R packages to bundle up the models with a consistent interface.
2. Create thin Plumber APIs that wrap these packages/models.
3. Build Docker images from these APIs.
4. Deploy API containers to a Docker Swarm (but you could use any orchestration).
5. Stick Nginx in front of them to get pretty, human-readable routes.
6. Call the models via HTTP from C#.
This stack works pretty well for us. Response times are generally fast and throughput is acceptable. And the speed at which we can get models into production is massively better than when we used to semi-manually translate model code to C#...
Probably the biggest issue was getting everyone on board with a standard process and API design, after a few iterations. And putting in place all the automation/process/culture to help data teams write robust, production-ready software.
Data scientists have a similar docker image running in kubernetes which includes all of these images as conda environments for experimenting in prod-like environments. Spark is used to fetch data for the most part.
Models report a finished state over Kafka after getting persisted to buckets in Google cloud, then gets mirrored over to a ceph cluster connected to our serving kubernetes cluster.
We have an in house Golang server binding to c++ for serving pytorch neural nets persisted with the torch.jit API (I can really recommend this for hassle-free model serving). We also have some Java apps for serving normal ALS or Annoy based models.
Our traffic is not as wild as many here, but we're serving around 10M user requests a day.
We also do a merging of results from several models' results, and join them together with a separate "meta-model" that estimates which model the user has had a preference for recently, to weight those up.
There's probably a lot of details left out here, especially about the serving part, since we have various services in front of the models enriching data and presenting it to the user, but it's the gist of it.
Models:
- Models are structured as python packages, each model inherits a base class
- base class has define how to train, and how to predict (as well as a few other more specific things)
- ML engineer can override model serialization methods, default is just pickle
Infra:
- Code is checked in to github, Docker container built each merge into master
- Use Sagemaker BYO container to train models, each job gets a job_id that represents the state produced by that job (code + data used)
Inference / deployment:
- Deploy model as http endpoints (SageMaker or internal) using job_id
- Have a service that centralizes all score requests, finds correct current endpoint for a model, emits score events to kinesis, track health of model endpoints
- A/B test either in scoring service or in product depending on requirements
- deploy prediction jobs using a job_id and a data source (usually sql) that can be configured to output data to S3 or our data warehouse
So far this has been pretty solid for us. The tradeoff has been theres a step between notebook and production for ML engineers which can slow them down, but it forces code review and increases the number of tests checked in.
We are currently looking at MLflow[3] for the tracking server, it has some major pain points though. We use Tune[4] for hyperparameter search, and MLflow provides no way to delete artifacts from the parallel runs which will lead to massive amounts of wasted storage or dangerous external cleanup scripts. They have also been resisting requests for the feature in numerous issues. Not a good open source solution in the space.
Note that this is for an embedded deployment environment.
[1] https://github.com/onnx/onnx
Although in one case we had very tight latency requirements (ie: 10ms) so the ML results were pre-computed and loaded from a cache on the backend servers.
Sometimes the end result is gRPC services, sometimes its some sort of serialized model (weights). Sometimes the model is specified in protobuf. Very rarely it's a HTTP API. I don't fancy those.
Ironically I haven't done much distributed models. Or if it's distributed, it's not some Kafka-esque monstrosity.
I rarely use Python for anything other than exploratory analyses now.
Being able to type `go build .` and have it run anywhere is pretty awesome
One dead simple way to do this (R model —> Java production) that I’ve done in the past is to use PMML (via pmml package), which converts models to an XML representation. ONNX is a similar/newer framework along these lines. You can also look at dbplyr for performing (dplyr-like) data preprocessing in-database.
Training:
Currently we just use a bunch of beefy desktop workstations for training (using Pytorch).
Deployment:
This is the vast majority of our cost, each time a paraphrase comes in we add it to a queue through google cloud Pubsub. We have a cluster of GPU (T4) servers pulling from the queue, generating paraphrases and then sending the responses back through Redis pub/sub. I think ideally we would have a system that makes it easier to batch sentences of similar length together, but this seems to be the most cost effective way for models that are too computationally expensive for the CPU that is relatively simple to put together.
PMML is language agnostic model specification (XML like). Python and R machine learning ecosystem can easily generate these (caveat, only tried for gbdt and linear models and not sure this works well for neural nets).
Openscoring is Java library that creates rest API for scoring models. It's lightweight, battle-tested, nice API, good model versioning and in my experience 10x faster than Python flask. You don't need to write any Java code, just download and run the .jar and post valid PMML to the right endpoint.
Another feasible approach is Sagemaker deploy - code from Jupyter notebook can deploy API in one line. I think this can be less economical and have higher latency if you will have high usage but a datascientist can do model updates from within a notebook.
Please NEVER hardcode regression model coefficients within Java. This is a nightmare to maintain, prevents increasing model complexity and is no simpler than PMML + openscoring. I think you can wrap the Java PMML library in another Java web framework like spring if you need something more bespoke.
https://www.rdocumentation.org/packages/pmml/versions/2.1.0/...
https://github.com/openscoring/openscoring
https://aws.amazon.com/blogs/machine-learning/using-r-with-a...
Here are the artifacts we produce:
1. For new models we often build a demo endpoints/glue code written in python/flask that can be compared against the prod output in dev/psup.
2. Deep learning models (much of what I do personally): saved in TF saved model format. If it is an update to an existing model often it is just a drop-in replacement. If it is a brand new model i will often include a flask demo (the python code does proper data transformation before calling on tf). On production side, after testing/regression these model are deployed via tensorflow-serving containers and used as gRPC endpoint. For production, whatever data pre-processing needs to be done is written by the backend team, who compare preprocessing output with our demo.
3. Logistic regression/tree models: again, for new models we provide the demo but what goes into production are either csv (logistic regression) or json (tree) of the weights/decision boundaries which are used as resources by the backend team's Java code.
The overall flow is:
ETL (via apache airflow/custom code) => model training/feature engineering => (saved model file + flask demo endpoint/documentation on feature transformations) => dev incorporate model/test into java backend => comparison of demo vs java backend => regression of java backend (if they had previous versions of model) => psup (small amount of prod data duplicated and ran in parallel with prod) => prod (model deployed + monitored)
There is a caveat that we also do some batch processing/not really live analysis that is just done in python and then results are pushed wherever they need to be pushed. In this case we don't involve the backend/java team.
Models and feature engineering done in python, trained locally, weights uploaded to S3. Dockerfile with a tiny little web server gets deployed through or CI/CD pipeline for serving.
Soon: Argo workflows + Polyaxon for data collection, feature engineering, training etc. Push best model tobS3, same CICD process with docker container deploys little web server onto our Kubernetes environment.
Deep learning stuff will probably use a similar setup, but with PyTorch instead of Sklearn. Would like to look at serving with ONNX exporting.
When the Julia packages evolve a little more, will be looking forward to using that in production.
We have a team of 6 DS working on parallel projects. We tried the Java service approach. It's great for a one-time model, but very painful to iterate on.
We develop on top of Sagemaker, and since we're a funded company, can somewhat get away with the 40% price increase of an "ML Instances".
We have a mix of R/Python models. For each, we keep a separate repo with a Dockerfile, build file, and src code.
Training:
If the jobs are small, we train them locally, package assets into the container, and deploy. If it's a bigger job, we leverage Sagemaker training jobs and S3 for model storage.
Serving:
We have boilerplate web service layer with an entrypoint that DS fills in with their own code. Yes, this allows almost arbitrary code to be written, but we do force code reviews and enforce standards. Convention over configuration.
We do the feature engineering using Python/R, which when parallelized, has good enough performance (sub 200ms latency on Sagemaker prod). If we need latencies in the 1-10ms range, we'd consider refactoring the feature engineering into a separate layer written in a more performant language. It's always FE that takes the most time.
One learning from tuning Python services is: for max performance, try to push the feature engineering work onto consumers. Have them fully specify the shape of the data in a format that your models expect so you do as little feature engineering in the serving step.
Continuous Training: Lambdas triggering SageMaker BYO training jobs.
Continuous Model Deployment: Lambdas polling SageMaker training jobs to update SageMaker BYO endpoints.
Inference endpoint: Lambda proxying to a SageMaker endpoint.
Our BYO endpoints allow us to do batch inference without a bunch of round trips between Lambda and SageMaker.
If endpoint costs get too high, we’ll implement some caching at the lambda layer.
- use builtin Spark ML models
- call a model running as a service
- write files for a model to ingest (for a legacy project)
- develop a custom plugin or UDF (for calling via SQL)
We have built in stages for running Spark ML models in the framework as well as HTTP and Tensorflow Serving stages to call services. We recently ran a series of models for NLP that were in Python and Ocaml via the HTTP stage sending payload either in JSON or other formats that the services needed. The text extraction via OCR (tesseract) had been done as a prior Spark stage. This design allows us to call these more custom ML models but keep them part of a larger Spark job and use SQL and other features when needed. The services where deployed in AWS Fargate to allow for scaling. For other jobs we are deploying our Arc jobs using Argo for orchestration. We spin up compute on demand vs running inside a persistent cluster.
For training we use Jupyter Notebooks where possible. We have a plugin that generates Arc jobs from these notebooks.
For special cases we can add custom plugins or UDF functions to extend the framework. I have done similar plugins to run XGBoost models in Spark for example.
Whilst we try to be prescriptive around the ML stack for Data Scientists this approach has allow flexibility where needed and for different teams to own their part of the job. This is particularly useful in larger teams where development is more federated.
Models are written in Python (mix of pytorch/NLP/tensorF). The Models are serving about 35 predictions/second on avg. The API server written in the Python. API server container feeds or write the requests in the distributed queue cluster. The models picks up the samples from the queue in batching. It allows to experiment the models (different flavor) based on the routing being set during the deployment time and which in turns being set in the cache. We use AWS managed cache, queuing and container orchestration platform. Next: 1)Current pipeline for the training and production is two separate pipeline which we want to combined, possibly use MLFlow, Airflow or KubeFlow. Deployment to the production is done through Jenkins. 2)Active retraining and auto deployment to production. 3)Tie the version of model in production to model being trained. There is no way for us to tie back the version.
Harness exposes a framework for adding Engines and does all the routing for Engine Instance workflow and lifecycle management. It also provides a toolbox of abstractions for using the Spark ecosystem with Mongo and Elasticsearch.
It comes in a docker-compose system for vertical scaling and Kubernetes for ultimate in scaling an automation. Quite a nice general system with out of the box usefulness.
We don't yet have a polyglot architecture, but we do have the requirement of running distributed services (partly because there are certain components of the pipeline that needs to be run on-premise), and we have found that workflow engines / orchestrators definitely makes it a lot easier to reason about the wider architecture / have a bird's eye view. It works for us. No need to handle callbacks, events, queues, etc. We also do have the potential to run a polyglot architecture.
We tried out Celery Workflows, and struggled to get it "production ready", so I'd advise against this for complex workflows. We also found the visibility lacking.
We have yet to fully try out Kubeflow, and MLflow. What is not quite working at the moment is creating, and deploying portable models. And I don't mean simply pickling, and storing an artifact.
Leveraging containers (Docker), and slapping simple anti-corruption layers (e.g. simple web APIs) has also helped. We have a more consistent way of deploying, and isolating code without having to rewrite much.
We want to look into using Nuclio, and/or knative to ease the process of deployment, and to empower the data scientists to deliver without much engineering expertise.
Others have mentioned using base classes or standard interfaces for their models. We tried this too, but it didn't work. The generalisation early on was met with conflicting requirements, and broke the interface segregation principle (not that it matters too much, but it can be confusing to not know precisely what is being used or not used). We figured it's much easier to procrastinate any abstractions. Let the data, and it's flow do the talking.
FastAPI for quickly creating new API endpoints. It has automatic _interactive_ docs and super simple data validation via Python typehints, so that we don't waste compute time with malformed data. https://fastapi.tiangolo.com/
We deploy on prem most of the time, but have started using GCP on occasion.
I guess we are a bit of an outlier, but we deploy the ML using Java / JVM. Not really in the same league as others here so the models are simple enough that the various Java ML frameworks are fine for it (DL4J, Smile, etc). We even do a lot of the interactive exploratory / training type work on the JVM (though via Groovy and Scala with BeakerX notebooks [1] - sometimes combined with Python and R).
I think as the field matures a lot more could move to this model.
I contract for some clients in fintech and some defense-related stuff.
Jobs are scheduled (Azkaban) for reruns/re-training and pushed from data env to the feature/model-store in live env (Cassandra). Online models are exported to SaveModel format and can be loaded on any TF platform, eg java backends.
Online inference using TF Serving. Clients query models via grpc.
A lot of our models are NN embedding lookups, we use Annoy for indexing those.
- CTO (chris at vetd.com)
We deploy the models as HTTP endpoints and consume them in R/Python/Excel.
We also have advanced functionality available to enterprise clients that are exposed as APIs with a customised JSON format to trigger various agents.
It looks to be a time saver.