Model Serving Frameworks

8.3.2. Model Serving Frameworks#

A Flask application wrapping a scikit-learn model is a perfectly valid way to expose predictions through an HTTP API. At small scale, it works well. But as request volume increases, this approach runs into limitations: the Python GIL limits true parallelism, there is no built-in mechanism to batch multiple requests together for efficiency, managing multiple models and their versions requires custom code, and there is no framework-level GPU batching for deep learning models.

Specialized model serving frameworks exist to address these limitations. They provide a runtime layer that sits between your model and the network, adding batching, multi-model management, version switching, health checking, and performance telemetry—features that you would otherwise have to build yourself. They are purpose-built for inference, not for general web application development.

The trade-off is complexity. A serving framework introduces additional infrastructure, configuration formats, and operational concepts. The right decision is straightforward: use Flask or FastAPI directly for prototypes, low-traffic APIs, and internal tools; reach for a serving framework when throughput, latency, or multi-model management requirements outgrow what a simple web framework can provide.

8.3.2.1. Major Model Serving Frameworks#

TorchServe (PyTorch)#

Official: PyTorch’s production serving framework

Key Features:

PyTorch-native support
Model versioning
Multi-model serving
RESTful and gRPC APIs
Metrics and logging
A/B testing

Architecture:

Client Request
      ↓
TorchServe Frontend
      ↓
Model Workers (parallel)
├── Worker 1 (Model Instance)
├── Worker 2 (Model Instance)
└── Worker 3 (Model Instance)
      ↓
Response

Conceptual Workflow:

Package Model (create .mar file):

torch-model-archiver \
  --model-name resnet18 \
  --version 1.0 \
  --model-file model.py \
  --serialized-file resnet18.pth \
  --handler image_classifier

Start Server:

torchserve --start \
  --model-store model_store \
  --models resnet18=resnet18.mar

Make Predictions:

curl -X POST http://localhost:8080/predictions/resnet18 \
  -T image.jpg

Benefits:

Optimized for PyTorch models
Active development by PyTorch team
Good documentation

Use when: Deploying PyTorch models in production

TensorFlow Serving#

Official: TensorFlow’s production serving system

Key Features:

High performance (C++ backend)
gRPC and REST APIs
Model versioning and hot-swapping
Request batching
GPUs acceleration

Architecture:

SavedModel Format
      ↓
TensorFlow Serving
├── Model Server (manages versions)
├── Aspired Version Policy
└── Batching Scheduler
      ↓
Predictions

Conceptual Workflow:

Save Model in SavedModel Format:

# TensorFlow 2.x
model.save('my_model/')

# Creates:
# my_model/
# ├── saved_model.pb
# ├── variables/
# └── assets/

Serve with Docker:

docker run -p 8501:8501 \
  --mount type=bind,source=/path/to/my_model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow/serving

Make Predictions:

curl -X POST http://localhost:8501/v1/models/my_model:predict \
  -H 'Content-Type: application/json' \
  -d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'

Benefits:

Extremely high performance
Mature and battle-tested
Rich feature set

Use when: TensorFlow models, high-performance requirements

NVIDIA Triton Inference Server#

Multi-Framework: Supports TensorFlow, PyTorch, ONNX, and more

Key Features:

Multiple framework support
Dynamic batching
Model ensembles
GPU optimization (CUDA, TensorRT)
Model analyzer
Concurrent model execution

Supported Backends:

TensorFlow
PyTorch (TorchScript)
ONNX Runtime
TensorRT (optimized)
Python (custom)
dali (preprocessing)

Architecture:

Triton Server
├── Model Repository
│   ├── model1/ (TensorFlow)
│   ├── model2/ (PyTorch)
│   └── model3/ (ONNX)
├── Scheduler
│   ├── Dynamic Batcher
│   ├── Sequence Batcher
│   └── Ensemble Scheduler
└── Inference Backends
    ├── TensorFlow Backend
    ├── PyTorch Backend
    └── ONNX Backend

Conceptual Setup:

Model Repository Structure:

model_repository/
└── my_model/
    ├── config.pbtxt
    └── 1/
        └── model.onnx

Configuration (config.pbtxt):

name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 10 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }
]
dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

Run Server:

docker run --gpus all -p 8000:8000 \
  -v /path/to/model_repository:/models \
  nvcr.io/nvidia/tritonserver:latest \
  tritonserver --model-repository=/models

Benefits:

Framework-agnostic
Excellent GPU utilization
Advanced batching strategies
NVIDIA’s optimization expertise

Use when: Multiple frameworks, GPU inference, maximum performance

Seldon Core (Kubernetes-Native)#

Platform: ML deployment on Kubernetes

Key Features:

Kubernetes-native
Multiple ML frameworks
A/B testing
Canary deployments
Explainability integration
Outlier detection
Drift monitoring

Architecture:

Kubernetes Cluster
└── Seldon Deployment
    ├── Model Server (scikit-learn)
    ├── Transformer (preprocessing)
    ├── Combiner (ensemble)
    └── Router (A/B testing)

Conceptual Deployment:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: sklearn-iris
spec:
  predictors:
  - name: default
    replicas: 3
    graph:
      name: classifier
      type: MODEL
      implementation: SKLEARN_SERVER
      modelUri: s3://my-bucket/sklearn-model
      parameters:
      - name: method
        value: predict_proba

Benefits:

Deep Kubernetes integration
Advanced deployment strategies
Model explainability built-in
Rich ML operations features

Use when: Kubernetes deployments, advanced ML operations

BentoML#

Unified: Framework-agnostic serving platform

Key Features:

Supports scikit-learn, PyTorch, TensorFlow, XGBoost, etc.
API server generation
Docker containerization
Adaptive batching
Model management

Conceptual Workflow:

Save Model:

import bentoml
from sklearn.ensemble import RandomForestClassifier

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Save with BentoML
bentoml.sklearn.save_model("my_model", model)

Create Service:

# service.py
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray

model_runner = bentoml.sklearn.get("my_model:latest").to_runner()
svc = bentoml.Service("classifier", runners=[model_runner])

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_data):
    return model_runner.predict.run(input_data)

Build and Deploy:

# Build container
bentoml containerize classifier:latest

# Run
docker run -p 3000:3000 classifier:latest

Benefits:

Beginner-friendly
Multi-framework support
Generates production-ready APIs
Good documentation

Use when: Want simplicity, multiple frameworks, Python-first approach

8.3.2.2. Getting Started#

Start Simple#

Flask/FastAPI for prototypes
Docker containers for basic production
Model serving framework for scale

Evaluate Before Migration#

Benchmark current performance
Identify bottlenecks
Test serving framework in staging
Measure improvements

Learn Incrementally#

Master basic features first
Add complexity as needed (batching, A/B testing, etc.)
Extensive documentation available for all frameworks

8.3.2.3. Summary#

Model serving frameworks provide:

Optimized inference runtimes
Request batching for higher throughput
Multi-model management with versioning
GPU acceleration for deep learning
Production features (monitoring, health checks)

Choose based on:

ML framework (PyTorch, TensorFlow, scikit-learn)
Performance requirements
Deployment platform (Kubernetes, cloud, on-prem)
Team expertise
Feature needs (A/B testing, explainability)

For most teams, start with simpler deployment methods and graduate to specialized serving frameworks as scale and performance requirements grow.

Documentation Links:

TorchServe: https://pytorch.org/serve/
TensorFlow Serving: https://www.tensorflow.org/tfx/guide/serving
NVIDIA Triton: https://developer.nvidia.com/nvidia-triton-inference-server
Seldon Core: https://docs.seldon.io/
BentoML: https://docs.bentoml.org/

Model Serving Frameworks

Contents

8.3.2. Model Serving Frameworks#

8.3.2.1. Major Model Serving Frameworks#

TorchServe (PyTorch)#

TensorFlow Serving#

NVIDIA Triton Inference Server#

Seldon Core (Kubernetes-Native)#

BentoML#

8.3.2.2. Getting Started#

Start Simple#

Evaluate Before Migration#

Learn Incrementally#

8.3.2.3. Summary#