8.3.2. Model Serving Frameworks#
A Flask application wrapping a scikit-learn model is a perfectly valid way to expose predictions through an HTTP API. At small scale, it works well. But as request volume increases, this approach runs into limitations: the Python GIL limits true parallelism, there is no built-in mechanism to batch multiple requests together for efficiency, managing multiple models and their versions requires custom code, and there is no framework-level GPU batching for deep learning models.
Specialized model serving frameworks exist to address these limitations. They provide a runtime layer that sits between your model and the network, adding batching, multi-model management, version switching, health checking, and performance telemetry—features that you would otherwise have to build yourself. They are purpose-built for inference, not for general web application development.
The trade-off is complexity. A serving framework introduces additional infrastructure, configuration formats, and operational concepts. The right decision is straightforward: use Flask or FastAPI directly for prototypes, low-traffic APIs, and internal tools; reach for a serving framework when throughput, latency, or multi-model management requirements outgrow what a simple web framework can provide.
8.3.2.1. Major Model Serving Frameworks#
TorchServe (PyTorch)#
Official: PyTorch’s production serving framework
Key Features:
PyTorch-native support
Model versioning
Multi-model serving
RESTful and gRPC APIs
Metrics and logging
A/B testing
Architecture:
Client Request
↓
TorchServe Frontend
↓
Model Workers (parallel)
├── Worker 1 (Model Instance)
├── Worker 2 (Model Instance)
└── Worker 3 (Model Instance)
↓
Response
Conceptual Workflow:
Package Model (create .mar file):
torch-model-archiver \
--model-name resnet18 \
--version 1.0 \
--model-file model.py \
--serialized-file resnet18.pth \
--handler image_classifier
Start Server:
torchserve --start \
--model-store model_store \
--models resnet18=resnet18.mar
Make Predictions:
curl -X POST http://localhost:8080/predictions/resnet18 \
-T image.jpg
Benefits:
Optimized for PyTorch models
Active development by PyTorch team
Good documentation
Use when: Deploying PyTorch models in production
TensorFlow Serving#
Official: TensorFlow’s production serving system
Key Features:
High performance (C++ backend)
gRPC and REST APIs
Model versioning and hot-swapping
Request batching
GPUs acceleration
Architecture:
SavedModel Format
↓
TensorFlow Serving
├── Model Server (manages versions)
├── Aspired Version Policy
└── Batching Scheduler
↓
Predictions
Conceptual Workflow:
Save Model in SavedModel Format:
# TensorFlow 2.x
model.save('my_model/')
# Creates:
# my_model/
# ├── saved_model.pb
# ├── variables/
# └── assets/
Serve with Docker:
docker run -p 8501:8501 \
--mount type=bind,source=/path/to/my_model,target=/models/my_model \
-e MODEL_NAME=my_model \
tensorflow/serving
Make Predictions:
curl -X POST http://localhost:8501/v1/models/my_model:predict \
-H 'Content-Type: application/json' \
-d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'
Benefits:
Extremely high performance
Mature and battle-tested
Rich feature set
Use when: TensorFlow models, high-performance requirements
NVIDIA Triton Inference Server#
Multi-Framework: Supports TensorFlow, PyTorch, ONNX, and more
Key Features:
Multiple framework support
Dynamic batching
Model ensembles
GPU optimization (CUDA, TensorRT)
Model analyzer
Concurrent model execution
Supported Backends:
TensorFlow
PyTorch (TorchScript)
ONNX Runtime
TensorRT (optimized)
Python (custom)
dali (preprocessing)
Architecture:
Triton Server
├── Model Repository
│ ├── model1/ (TensorFlow)
│ ├── model2/ (PyTorch)
│ └── model3/ (ONNX)
├── Scheduler
│ ├── Dynamic Batcher
│ ├── Sequence Batcher
│ └── Ensemble Scheduler
└── Inference Backends
├── TensorFlow Backend
├── PyTorch Backend
└── ONNX Backend
Conceptual Setup:
Model Repository Structure:
model_repository/
└── my_model/
├── config.pbtxt
└── 1/
└── model.onnx
Configuration (config.pbtxt):
name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 10 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 1 ]
}
]
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}
Run Server:
docker run --gpus all -p 8000:8000 \
-v /path/to/model_repository:/models \
nvcr.io/nvidia/tritonserver:latest \
tritonserver --model-repository=/models
Benefits:
Framework-agnostic
Excellent GPU utilization
Advanced batching strategies
NVIDIA’s optimization expertise
Use when: Multiple frameworks, GPU inference, maximum performance
Seldon Core (Kubernetes-Native)#
Platform: ML deployment on Kubernetes
Key Features:
Kubernetes-native
Multiple ML frameworks
A/B testing
Canary deployments
Explainability integration
Outlier detection
Drift monitoring
Architecture:
Kubernetes Cluster
└── Seldon Deployment
├── Model Server (scikit-learn)
├── Transformer (preprocessing)
├── Combiner (ensemble)
└── Router (A/B testing)
Conceptual Deployment:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: sklearn-iris
spec:
predictors:
- name: default
replicas: 3
graph:
name: classifier
type: MODEL
implementation: SKLEARN_SERVER
modelUri: s3://my-bucket/sklearn-model
parameters:
- name: method
value: predict_proba
Benefits:
Deep Kubernetes integration
Advanced deployment strategies
Model explainability built-in
Rich ML operations features
Use when: Kubernetes deployments, advanced ML operations
BentoML#
Unified: Framework-agnostic serving platform
Key Features:
Supports scikit-learn, PyTorch, TensorFlow, XGBoost, etc.
API server generation
Docker containerization
Adaptive batching
Model management
Conceptual Workflow:
Save Model:
import bentoml
from sklearn.ensemble import RandomForestClassifier
# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Save with BentoML
bentoml.sklearn.save_model("my_model", model)
Create Service:
# service.py
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray
model_runner = bentoml.sklearn.get("my_model:latest").to_runner()
svc = bentoml.Service("classifier", runners=[model_runner])
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_data):
return model_runner.predict.run(input_data)
Build and Deploy:
# Build container
bentoml containerize classifier:latest
# Run
docker run -p 3000:3000 classifier:latest
Benefits:
Beginner-friendly
Multi-framework support
Generates production-ready APIs
Good documentation
Use when: Want simplicity, multiple frameworks, Python-first approach
8.3.2.2. Getting Started#
Start Simple#
Flask/FastAPI for prototypes
Docker containers for basic production
Model serving framework for scale
Evaluate Before Migration#
Benchmark current performance
Identify bottlenecks
Test serving framework in staging
Measure improvements
Learn Incrementally#
Master basic features first
Add complexity as needed (batching, A/B testing, etc.)
Extensive documentation available for all frameworks
8.3.2.3. Summary#
Model serving frameworks provide:
Optimized inference runtimes
Request batching for higher throughput
Multi-model management with versioning
GPU acceleration for deep learning
Production features (monitoring, health checks)
Choose based on:
ML framework (PyTorch, TensorFlow, scikit-learn)
Performance requirements
Deployment platform (Kubernetes, cloud, on-prem)
Team expertise
Feature needs (A/B testing, explainability)
For most teams, start with simpler deployment methods and graduate to specialized serving frameworks as scale and performance requirements grow.
Documentation Links:
TorchServe: https://pytorch.org/serve/
TensorFlow Serving: https://www.tensorflow.org/tfx/guide/serving
NVIDIA Triton: https://developer.nvidia.com/nvidia-triton-inference-server
Seldon Core: https://docs.seldon.io/
BentoML: https://docs.bentoml.org/