8.3. Deployment Landscape#
With a serialized model and a container image in hand, the remaining question is where and how to run it. The options span a wide range: a single cloud virtual machine running your Docker container, a managed container service that handles scaling automatically, a Kubernetes cluster orchestrating dozens of model replicas, or a purpose-built model serving framework optimised for high-throughput inference. At the other extreme, some models are deployed directly to edge devices that run inference locally with no network connection at all.
Choosing the right approach depends on several concrete factors: how many requests per second you expect, what latency your application can tolerate, how frequently the model will be updated, and what level of operational complexity your team can maintain. A small internal tool and a high-traffic consumer product call for very different deployment strategies.
This section surveys the major options—cloud deployment patterns, specialized model serving frameworks, and the broader considerations (latency, scaling, monitoring, reliability) that determine whether a deployed model holds up in practice.