8.3.2. Model Serving Frameworks#

A Flask application wrapping a scikit-learn model is a perfectly valid way to expose predictions through an HTTP API. At small scale, it works well. But as request volume increases, this approach runs into limitations: the Python GIL limits true parallelism, there is no built-in mechanism to batch multiple requests together for efficiency, managing multiple models and their versions requires custom code, and there is no framework-level GPU batching for deep learning models.

Specialized model serving frameworks exist to address these limitations. They provide a runtime layer that sits between your model and the network, adding batching, multi-model management, version switching, health checking, and performance telemetry—features that you would otherwise have to build yourself. They are purpose-built for inference, not for general web application development.

The trade-off is complexity. A serving framework introduces additional infrastructure, configuration formats, and operational concepts. The right decision is straightforward: use Flask or FastAPI directly for prototypes, low-traffic APIs, and internal tools; reach for a serving framework when throughput, latency, or multi-model management requirements outgrow what a simple web framework can provide.

8.3.2.1. Major Model Serving Frameworks#

TorchServe (PyTorch)#

Official: PyTorch’s production serving framework

Key Features:

  • PyTorch-native support

  • Model versioning

  • Multi-model serving

  • RESTful and gRPC APIs

  • Metrics and logging

  • A/B testing

Architecture:

Client Request
      ↓
TorchServe Frontend
      ↓
Model Workers (parallel)
├── Worker 1 (Model Instance)
├── Worker 2 (Model Instance)
└── Worker 3 (Model Instance)
      ↓
Response

Conceptual Workflow:

  1. Package Model (create .mar file):

torch-model-archiver \
  --model-name resnet18 \
  --version 1.0 \
  --model-file model.py \
  --serialized-file resnet18.pth \
  --handler image_classifier
  1. Start Server:

torchserve --start \
  --model-store model_store \
  --models resnet18=resnet18.mar
  1. Make Predictions:

curl -X POST http://localhost:8080/predictions/resnet18 \
  -T image.jpg

Benefits:

  • Optimized for PyTorch models

  • Active development by PyTorch team

  • Good documentation

Use when: Deploying PyTorch models in production

TensorFlow Serving#

Official: TensorFlow’s production serving system

Key Features:

  • High performance (C++ backend)

  • gRPC and REST APIs

  • Model versioning and hot-swapping

  • Request batching

  • GPUs acceleration

Architecture:

SavedModel Format
      ↓
TensorFlow Serving
├── Model Server (manages versions)
├── Aspired Version Policy
└── Batching Scheduler
      ↓
Predictions

Conceptual Workflow:

  1. Save Model in SavedModel Format:

# TensorFlow 2.x
model.save('my_model/')

# Creates:
# my_model/
# ├── saved_model.pb
# ├── variables/
# └── assets/
  1. Serve with Docker:

docker run -p 8501:8501 \
  --mount type=bind,source=/path/to/my_model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow/serving
  1. Make Predictions:

curl -X POST http://localhost:8501/v1/models/my_model:predict \
  -H 'Content-Type: application/json' \
  -d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'

Benefits:

  • Extremely high performance

  • Mature and battle-tested

  • Rich feature set

Use when: TensorFlow models, high-performance requirements

NVIDIA Triton Inference Server#

Multi-Framework: Supports TensorFlow, PyTorch, ONNX, and more

Key Features:

  • Multiple framework support

  • Dynamic batching

  • Model ensembles

  • GPU optimization (CUDA, TensorRT)

  • Model analyzer

  • Concurrent model execution

Supported Backends:

  • TensorFlow

  • PyTorch (TorchScript)

  • ONNX Runtime

  • TensorRT (optimized)

  • Python (custom)

  • dali (preprocessing)

Architecture:

Triton Server
├── Model Repository
│   ├── model1/ (TensorFlow)
│   ├── model2/ (PyTorch)
│   └── model3/ (ONNX)
├── Scheduler
│   ├── Dynamic Batcher
│   ├── Sequence Batcher
│   └── Ensemble Scheduler
└── Inference Backends
    ├── TensorFlow Backend
    ├── PyTorch Backend
    └── ONNX Backend

Conceptual Setup:

  1. Model Repository Structure:

model_repository/
└── my_model/
    ├── config.pbtxt
    └── 1/
        └── model.onnx
  1. Configuration (config.pbtxt):

name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 10 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }
]
dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}
  1. Run Server:

docker run --gpus all -p 8000:8000 \
  -v /path/to/model_repository:/models \
  nvcr.io/nvidia/tritonserver:latest \
  tritonserver --model-repository=/models

Benefits:

  • Framework-agnostic

  • Excellent GPU utilization

  • Advanced batching strategies

  • NVIDIA’s optimization expertise

Use when: Multiple frameworks, GPU inference, maximum performance

Seldon Core (Kubernetes-Native)#

Platform: ML deployment on Kubernetes

Key Features:

  • Kubernetes-native

  • Multiple ML frameworks

  • A/B testing

  • Canary deployments

  • Explainability integration

  • Outlier detection

  • Drift monitoring

Architecture:

Kubernetes Cluster
└── Seldon Deployment
    ├── Model Server (scikit-learn)
    ├── Transformer (preprocessing)
    ├── Combiner (ensemble)
    └── Router (A/B testing)

Conceptual Deployment:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: sklearn-iris
spec:
  predictors:
  - name: default
    replicas: 3
    graph:
      name: classifier
      type: MODEL
      implementation: SKLEARN_SERVER
      modelUri: s3://my-bucket/sklearn-model
      parameters:
      - name: method
        value: predict_proba

Benefits:

  • Deep Kubernetes integration

  • Advanced deployment strategies

  • Model explainability built-in

  • Rich ML operations features

Use when: Kubernetes deployments, advanced ML operations

BentoML#

Unified: Framework-agnostic serving platform

Key Features:

  • Supports scikit-learn, PyTorch, TensorFlow, XGBoost, etc.

  • API server generation

  • Docker containerization

  • Adaptive batching

  • Model management

Conceptual Workflow:

  1. Save Model:

import bentoml
from sklearn.ensemble import RandomForestClassifier

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Save with BentoML
bentoml.sklearn.save_model("my_model", model)
  1. Create Service:

# service.py
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray

model_runner = bentoml.sklearn.get("my_model:latest").to_runner()
svc = bentoml.Service("classifier", runners=[model_runner])

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_data):
    return model_runner.predict.run(input_data)
  1. Build and Deploy:

# Build container
bentoml containerize classifier:latest

# Run
docker run -p 3000:3000 classifier:latest

Benefits:

  • Beginner-friendly

  • Multi-framework support

  • Generates production-ready APIs

  • Good documentation

Use when: Want simplicity, multiple frameworks, Python-first approach

8.3.2.2. Getting Started#

Start Simple#

  1. Flask/FastAPI for prototypes

  2. Docker containers for basic production

  3. Model serving framework for scale

Evaluate Before Migration#

  • Benchmark current performance

  • Identify bottlenecks

  • Test serving framework in staging

  • Measure improvements

Learn Incrementally#

  • Master basic features first

  • Add complexity as needed (batching, A/B testing, etc.)

  • Extensive documentation available for all frameworks

8.3.2.3. Summary#

Model serving frameworks provide:

  • Optimized inference runtimes

  • Request batching for higher throughput

  • Multi-model management with versioning

  • GPU acceleration for deep learning

  • Production features (monitoring, health checks)

Choose based on:

  • ML framework (PyTorch, TensorFlow, scikit-learn)

  • Performance requirements

  • Deployment platform (Kubernetes, cloud, on-prem)

  • Team expertise

  • Feature needs (A/B testing, explainability)

For most teams, start with simpler deployment methods and graduate to specialized serving frameworks as scale and performance requirements grow.

Documentation Links: