HACKER Q&A
📣 imagiko

What is your ML stack like?


How did your team build out AI/ML pipelines and integrated it with your existing codebase? For example, how did your backend team(using Java?) work in sync with data teams (using R or python?) to have minimal rewriting/glue code as possible to deploy models in production. What were your architectural decisions that worked, or didn't?

I'm currently working to make an ML model written in R work on our backend system written in Java. After the dust settles I'll be looking for ways to streamline this process.


  👤 aaron-santos Accepted Answer ✓
What didn't work:

Shipping pickled models to other teams.

Deploying Sagemaker endpoints (too costly).

Requiring editing of config files to deploy endpoints.

What did work:

Shipping http endpoints.

Deriving api documentation from model docstrings.

Deploying lambdas (less costly than Sagemaker endpoints).

Writing a ~150 line python script to pickle the model, save a requirements.txt, some api metadata, and test input/output data.

Continuous deployment (after model is saved no manual intervention if model response matches output data).


👤 rozgo
Custom Unreal Engine simulator, simulating agents with NVidia Physx and publishing sensors through GStreamer. GStreamer has sinks and sources for ROS, and tensorflow elements for inferencing. We package this all into NVidia Docker for scalable simulations. Setup is similar for training and inference. The core framework is a streaming engine with stream combinators that enable reasoning about spatio-temporal data streams. Where each datum is related to a point in space and time. The goal is for tensors to be the streaming primitives, but the pipeline is still fragile, it’s a challenge to keep all this working with so many core technologies changing constantly (UE4 + Physx + Cuda + CuDNN + Tensorflow + …). We train robots in simulation.

Most of our tools are built in Rust. Several of those are for creating(cleaning) data streams out of datasets. They are converted into tensors or ROS messages.


👤 atheriel
Our organization looks similar: models are almost all written in R but the business operates in C#. We just use simple HTTP APIs to intermediate. Specifically, we do the following:

1. Use R packages to bundle up the models with a consistent interface.

2. Create thin Plumber APIs that wrap these packages/models.

3. Build Docker images from these APIs.

4. Deploy API containers to a Docker Swarm (but you could use any orchestration).

5. Stick Nginx in front of them to get pretty, human-readable routes.

6. Call the models via HTTP from C#.

This stack works pretty well for us. Response times are generally fast and throughput is acceptable. And the speed at which we can get models into production is massively better than when we used to semi-manually translate model code to C#...

Probably the biggest issue was getting everyone on board with a standard process and API design, after a few iterations. And putting in place all the automation/process/culture to help data teams write robust, production-ready software.


👤 NegatioN
We train models as kubernetes cronjobs defined by a minimal properties file per model defining number of cps/gpus/mem. They will start with a given image (ex pytorch or tf) based on where in the repository these files are placed, and will then run a user specified bash file to start the job.

Data scientists have a similar docker image running in kubernetes which includes all of these images as conda environments for experimenting in prod-like environments. Spark is used to fetch data for the most part.

Models report a finished state over Kafka after getting persisted to buckets in Google cloud, then gets mirrored over to a ceph cluster connected to our serving kubernetes cluster.

We have an in house Golang server binding to c++ for serving pytorch neural nets persisted with the torch.jit API (I can really recommend this for hassle-free model serving). We also have some Java apps for serving normal ALS or Annoy based models.

Our traffic is not as wild as many here, but we're serving around 10M user requests a day.

We also do a merging of results from several models' results, and join them together with a separate "meta-model" that estimates which model the user has had a preference for recently, to weight those up.

There's probably a lot of details left out here, especially about the serving part, since we have various services in front of the models enriching data and presenting it to the user, but it's the gist of it.


👤 __erik
What has worked fairly well so far:

Models:

- Models are structured as python packages, each model inherits a base class

- base class has define how to train, and how to predict (as well as a few other more specific things)

- ML engineer can override model serialization methods, default is just pickle

Infra:

- Code is checked in to github, Docker container built each merge into master

- Use Sagemaker BYO container to train models, each job gets a job_id that represents the state produced by that job (code + data used)

Inference / deployment:

- Deploy model as http endpoints (SageMaker or internal) using job_id

- Have a service that centralizes all score requests, finds correct current endpoint for a model, emits score events to kinesis, track health of model endpoints

- A/B test either in scoring service or in product depending on requirements

- deploy prediction jobs using a job_id and a data source (usually sql) that can be configured to output data to S3 or our data warehouse

So far this has been pretty solid for us. The tradeoff has been theres a step between notebook and production for ML engineers which can slow them down, but it forces code review and increases the number of tests checked in.


👤 Datenstrom
We are framework agnostic for model development, models get converted to ONNX[1] and served with the ONNX runtime[2]. They are deployed as microservices with docker.

We are currently looking at MLflow[3] for the tracking server, it has some major pain points though. We use Tune[4] for hyperparameter search, and MLflow provides no way to delete artifacts from the parallel runs which will lead to massive amounts of wasted storage or dangerous external cleanup scripts. They have also been resisting requests for the feature in numerous issues. Not a good open source solution in the space.

Note that this is for an embedded deployment environment.

[1] https://github.com/onnx/onnx

[2] https://github.com/Microsoft/onnxruntime

[3] https://mlflow.org/

[4] https://ray.readthedocs.io/en/latest/tune.html


👤 marcinzm
The systems I've seen basically break things into different services. Tied together with gRPC or Thrift which have code generators for most languages. So the Java backend simply makes RPC requests to a server running R.

Although in one case we had very tight latency requirements (ie: 10ms) so the ML results were pre-computed and loaded from a cache on the backend servers.


👤 chewxy
Almost entirely in Go. I use Gorgonia [0] and Gonum [1]. Granted I wrote Gorgonia. All solutions fit into the company's CI/CD infra with almost no additional overhead.

Sometimes the end result is gRPC services, sometimes its some sort of serialized model (weights). Sometimes the model is specified in protobuf. Very rarely it's a HTTP API. I don't fancy those.

Ironically I haven't done much distributed models. Or if it's distributed, it's not some Kafka-esque monstrosity.

I rarely use Python for anything other than exploratory analyses now.

Being able to type `go build .` and have it run anywhere is pretty awesome

[0] https://gorgonia.org [1] https://gonum.org


👤 jointpdf
Check out the CRAN task view on this topic: https://cran.r-project.org/web/views/ModelDeployment.html

One dead simple way to do this (R model —> Java production) that I’ve done in the past is to use PMML (via pmml package), which converts models to an XML representation. ONNX is a similar/newer framework along these lines. You can also look at dbplyr for performing (dplyr-like) data preprocessing in-database.


👤 saternius
What we do at https://quillbot.com

Training:

Currently we just use a bunch of beefy desktop workstations for training (using Pytorch).

Deployment:

This is the vast majority of our cost, each time a paraphrase comes in we add it to a queue through google cloud Pubsub. We have a cluster of GPU (T4) servers pulling from the queue, generating paraphrases and then sending the responses back through Redis pub/sub. I think ideally we would have a system that makes it easier to batch sentences of similar length together, but this seems to be the most cost effective way for models that are too computationally expensive for the CPU that is relatively simple to put together.


👤 gautamcgoel
I'm a PhD student at Caltech, working on the theoretical foundations of ML. I personally don't do a lot of coding, but basically everyone in my department uses Python (especially Pytorch) for deep learning/ML. This all runs on Nvidia GPUs (never seen an AMD GPU in the office). Occasionally people code in Matlab, especially if they work in optimization or control. Tmux and git are the only command line tools I see commonly used. Occasionally people ssh into an Amazon box if they need more compute.

👤 oli5679
I'd recommend exporting R model as PMML file, and getting your Java team to interact with Openscoring server.

PMML is language agnostic model specification (XML like). Python and R machine learning ecosystem can easily generate these (caveat, only tried for gbdt and linear models and not sure this works well for neural nets).

Openscoring is Java library that creates rest API for scoring models. It's lightweight, battle-tested, nice API, good model versioning and in my experience 10x faster than Python flask. You don't need to write any Java code, just download and run the .jar and post valid PMML to the right endpoint.

Another feasible approach is Sagemaker deploy - code from Jupyter notebook can deploy API in one line. I think this can be less economical and have higher latency if you will have high usage but a datascientist can do model updates from within a notebook.

Please NEVER hardcode regression model coefficients within Java. This is a nightmare to maintain, prevents increasing model complexity and is no simpler than PMML + openscoring. I think you can wrap the Java PMML library in another Java web framework like spring if you need something more bespoke.

https://www.rdocumentation.org/packages/pmml/versions/2.1.0/...

https://github.com/openscoring/openscoring

https://aws.amazon.com/blogs/machine-learning/using-r-with-a...


👤 ivalm
We have a bit of a problem like what you mention. Our backend/app is in Java but the DS/ML team generally works in python. The ML team basically doesn't ship production code.

Here are the artifacts we produce:

1. For new models we often build a demo endpoints/glue code written in python/flask that can be compared against the prod output in dev/psup.

2. Deep learning models (much of what I do personally): saved in TF saved model format. If it is an update to an existing model often it is just a drop-in replacement. If it is a brand new model i will often include a flask demo (the python code does proper data transformation before calling on tf). On production side, after testing/regression these model are deployed via tensorflow-serving containers and used as gRPC endpoint. For production, whatever data pre-processing needs to be done is written by the backend team, who compare preprocessing output with our demo.

3. Logistic regression/tree models: again, for new models we provide the demo but what goes into production are either csv (logistic regression) or json (tree) of the weights/decision boundaries which are used as resources by the backend team's Java code.

The overall flow is:

ETL (via apache airflow/custom code) => model training/feature engineering => (saved model file + flask demo endpoint/documentation on feature transformations) => dev incorporate model/test into java backend => comparison of demo vs java backend => regression of java backend (if they had previous versions of model) => psup (small amount of prod data duplicated and ran in parallel with prod) => prod (model deployed + monitored)

There is a caveat that we also do some batch processing/not really live analysis that is just done in python and then results are pushed wherever they need to be pushed. In this case we don't involve the backend/java team.


👤 FridgeSeal
Currently:

Models and feature engineering done in python, trained locally, weights uploaded to S3. Dockerfile with a tiny little web server gets deployed through or CI/CD pipeline for serving.

Soon: Argo workflows + Polyaxon for data collection, feature engineering, training etc. Push best model tobS3, same CICD process with docker container deploys little web server onto our Kubernetes environment.

Deep learning stuff will probably use a similar setup, but with PyTorch instead of Sklearn. Would like to look at serving with ONNX exporting.

When the Julia packages evolve a little more, will be looking forward to using that in production.


👤 atak1
Background:

We have a team of 6 DS working on parallel projects. We tried the Java service approach. It's great for a one-time model, but very painful to iterate on.

We develop on top of Sagemaker, and since we're a funded company, can somewhat get away with the 40% price increase of an "ML Instances".

We have a mix of R/Python models. For each, we keep a separate repo with a Dockerfile, build file, and src code.

Training:

If the jobs are small, we train them locally, package assets into the container, and deploy. If it's a bigger job, we leverage Sagemaker training jobs and S3 for model storage.

Serving:

We have boilerplate web service layer with an entrypoint that DS fills in with their own code. Yes, this allows almost arbitrary code to be written, but we do force code reviews and enforce standards. Convention over configuration.

We do the feature engineering using Python/R, which when parallelized, has good enough performance (sub 200ms latency on Sagemaker prod). If we need latencies in the 1-10ms range, we'd consider refactoring the feature engineering into a separate layer written in a more performant language. It's always FE that takes the most time.

One learning from tuning Python services is: for max performance, try to push the feature engineering work onto consumers. Have them fully specify the shape of the data in a format that your models expect so you do as little feature engineering in the serving step.


👤 orasis
Data ingest: AWS Lambda (JavaScript) using the Serverless Framework to Kinesis Firehose.

Continuous Training: Lambdas triggering SageMaker BYO training jobs.

Continuous Model Deployment: Lambdas polling SageMaker training jobs to update SageMaker BYO endpoints.

Inference endpoint: Lambda proxying to a SageMaker endpoint.

Our BYO endpoints allow us to do batch inference without a bunch of round trips between Lambda and SageMaker.

If endpoint costs get too high, we’ll implement some caching at the lambda layer.


👤 brucej
For our current projects we use our open source Apache Spark framework Arc (https://arc.tripl.ai/) for feature prep then depending on the type of model we will either:

- use builtin Spark ML models

- call a model running as a service

- write files for a model to ingest (for a legacy project)

- develop a custom plugin or UDF (for calling via SQL)

We have built in stages for running Spark ML models in the framework as well as HTTP and Tensorflow Serving stages to call services. We recently ran a series of models for NLP that were in Python and Ocaml via the HTTP stage sending payload either in JSON or other formats that the services needed. The text extraction via OCR (tesseract) had been done as a prior Spark stage. This design allows us to call these more custom ML models but keep them part of a larger Spark job and use SQL and other features when needed. The services where deployed in AWS Fargate to allow for scaling. For other jobs we are deploying our Arc jobs using Argo for orchestration. We spin up compute on demand vs running inside a persistent cluster.

For training we use Jupyter Notebooks where possible. We have a plugin that generates Arc jobs from these notebooks.

For special cases we can add custom plugins or UDF functions to extend the framework. I have done similar plugins to run XGBoost models in Spark for example.

Whilst we try to be prescriptive around the ML stack for Data Scientists this approach has allow flexibility where needed and for different teams to own their part of the job. This is particularly useful in larger teams where development is more federated.


👤 superkitty
We use self deployable configuration to allow Data Scientist to control the model's destiny.

Models are written in Python (mix of pytorch/NLP/tensorF). The Models are serving about 35 predictions/second on avg. The API server written in the Python. API server container feeds or write the requests in the distributed queue cluster. The models picks up the samples from the queue in batching. It allows to experiment the models (different flavor) based on the routing being set during the deployment time and which in turns being set in the cache. We use AWS managed cache, queuing and container orchestration platform. Next: 1)Current pipeline for the training and production is two separate pipeline which we want to combined, possibly use MLFlow, Airflow or KubeFlow. Deployment to the production is done through Jenkins. 2)Active retraining and auto deployment to production. 3)Tie the version of model in production to model being trained. There is no way for us to tie back the version.


👤 Sloppy
We developed an OSS ML Server called Harness. It does all ingest, prepare, Algorithm management, workflow bits for pugable ML "Engines". These are Algorithms + Datasets + Models and are flexible enough to do most anything. We use the build-in Universal Recommender Engine, and have built our own for other uses.

Harness exposes a framework for adding Engines and does all the routing for Engine Instance workflow and lifecycle management. It also provides a toolbox of abstractions for using the Spark ecosystem with Mongo and Elasticsearch.

It comes in a docker-compose system for vertical scaling and Kubernetes for ultimate in scaling an automation. Quite a nice general system with out of the box usefulness.

https://github.com/actionml/harness


👤 MrSaints
We are currently experimenting with workflow engines for orchestrating different components, e.g. data ingestion, data preprocessing, feature engineering, scoring, automated decision making / escalation. Namely, Argo for offline processing, and bulk processing; Zeebe, and Cadence (trying both out) for online processing, and business logic / application services.

We don't yet have a polyglot architecture, but we do have the requirement of running distributed services (partly because there are certain components of the pipeline that needs to be run on-premise), and we have found that workflow engines / orchestrators definitely makes it a lot easier to reason about the wider architecture / have a bird's eye view. It works for us. No need to handle callbacks, events, queues, etc. We also do have the potential to run a polyglot architecture.

We tried out Celery Workflows, and struggled to get it "production ready", so I'd advise against this for complex workflows. We also found the visibility lacking.

We have yet to fully try out Kubeflow, and MLflow. What is not quite working at the moment is creating, and deploying portable models. And I don't mean simply pickling, and storing an artifact.

Leveraging containers (Docker), and slapping simple anti-corruption layers (e.g. simple web APIs) has also helped. We have a more consistent way of deploying, and isolating code without having to rewrite much.

We want to look into using Nuclio, and/or knative to ease the process of deployment, and to empower the data scientists to deliver without much engineering expertise.

Others have mentioned using base classes or standard interfaces for their models. We tried this too, but it didn't work. The generalisation early on was met with conflicting requirements, and broke the interface segregation principle (not that it matters too much, but it can be confusing to not know precisely what is being used or not used). We figured it's much easier to procrastinate any abstractions. Let the data, and it's flow do the talking.


👤 klowrey
Full stack Julia for RL (and trajectory optimization)

👤 ZeroCool2u
Plotly's Dash to prototype front ends that ingest/use the model output (Our team is Python only, but others use R, so this works great, because it supports both.) https://dash.plot.ly/

FastAPI for quickly creating new API endpoints. It has automatic _interactive_ docs and super simple data validation via Python typehints, so that we don't waste compute time with malformed data. https://fastapi.tiangolo.com/

We deploy on prem most of the time, but have started using GCP on occasion.


👤 zmmmmm
> or example, how did your backend team(using Java?) work in sync with data teams

I guess we are a bit of an outlier, but we deploy the ML using Java / JVM. Not really in the same league as others here so the models are simple enough that the various Java ML frameworks are fine for it (DL4J, Smile, etc). We even do a lot of the interactive exploratory / training type work on the JVM (though via Groovy and Scala with BeakerX notebooks [1] - sometimes combined with Python and R).

I think as the field matures a lot more could move to this model.

[1] http://beakerx.com/


👤 sam0x17
crystal / shainet (https://github.com/NeuraLegion/shainet)

I contract for some clients in fintech and some defense-related stuff.


👤 eggie5
Development of models in our data environment: notebooks, pyspark EMR clusters for analytical workloads and offline models, tensorflow/EC2 P2s for online models.

Jobs are scheduled (Azkaban) for reruns/re-training and pushed from data env to the feature/model-store in live env (Cassandra). Online models are exported to SaveModel format and can be loaded on any TF platform, eg java backends.

Online inference using TF Serving. Clients query models via grpc.

A lot of our models are NN embedding lookups, we use Annoy for indexing those.


👤 kevinskii
We're currently running a single NVIDIA RTX2080 with Tensorflow 2.0 on a Windows 10 station. We'll soon be switching to a standard multi-GPU rig running an air gapped Linux distro. Linux seems overall much better for ML because of better Docker integration and tensor core support on the newer GPUs. Also, we'll probably be switching from Tensorflow to Pytorch for model development. Pytorch requires a little bit more code, but debugging is 10X easier.

👤 rhacker
I'm pretty impressed with the level of automation I'm seeing in general. Looks like many are using docker/k8s or containers in some way or another. Inspiring.

👤 tvinko
Hi Aaron, I was having similar issues. My main problem was integration of different programming languages and tools under same roof. So I started with my own ML platform. Currently C# is supported, but there are other ones in the roadmap (Python, R, Nodejs...) You can check it here : https://github.com/Zenodys/ZenDevTool

👤 elwell
On a meta note, if you're interested in viewing & sharing stacks, that's the primary feature of the startup I'm working on: Vetd (app.vetd.com). The communities we host (often VC portfolio companies) share their stacks and leverage for discounts.

- CTO (chris at vetd.com)


👤 mendeza
How do you all monitor concept drift or monitoring to detect when its time to deploy a fresh model?

👤 jacquesm
Python (the Conda distribution, not small but quite easy and batteries included), Keras, Tensorflow, NVidia hardware (GTX1080ti, not sure what the current sweet spot for price/performance in GPU land is but that was the best I could get at the time).

👤 rewritteninrust
I'm also curious—and maybe someone here can chime in—about how you get organizational buy in for introducing ML. There are a couple of problem areas at my company that I think would be great for ML, but I don't know how to get others onboard.

👤 suyash
For my pet projects, do training and testing locally on my machine either using Notebooks or on an IDE. Test and validate it further on my local machine before deploying it on a server as a micro service. This is for my pet projects only.

👤 cercatrova
We used TensorFlow Serving running in a docker container that is a rest endpoint for predictions. Then the backend can be agnostic to Serving. We personally used Node to query the container for whatever we needed.

👤 mistrial9
our team built a Postgres backend on a single physical server (with a lot of good disk in RAID and others), python ran against the database, calling sklearn libs .. (skipping problem specific libs involved, but they added to either PG or the python side) Worked really, really well, easy to work on .. good architectural separation of stages in the process. Completed and shipped to an impressed customer. No GPUs

👤 GolDDranks
scikit-learn + Optuna for hyperparameter search + Python + Docker + AWS Batch/AWS Fargate. We are going to incorporate XGBoost or LightBGM at some point. The AWS services have some slowness in starting up, but the pipelines are not for online services so it works.

👤 itronitron
why not write the ML model in Java?

👤 xiaodai
We use our own product https://PI.EXCHANGE

We deploy the models as HTTP endpoints and consume them in R/Python/Excel.

We also have advanced functionality available to enterprise clients that are exposed as APIs with a customised JSON format to trigger various agents.


👤 burner6565
My company is considering licensing DataRobot.

It looks to be a time saver.


👤 mister_hn
OpenCV + dlib + CUDA or caffe2 as alternative

👤 copperfitting
Idk i would love good answers here