Abstract The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose InfAdapter, that proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler)
-
Create a Kubernetes cluster
- Create a K8s cluster using Microk8s: Get started
- Add another node to the k8s cluster: Create a MicroK8s cluster
-
Set up Prometheus monitoring inside the cluster Setup Monitoring
-
Create a namespace called mehran:
kubectl create ns mehran
-
Build resnet models for TensorFlow Serving: instructions at here
-
Configure NFS server to keep and serve our models: insructions at here
-
Export a cluster node's IP:
export CLUSTER_NODE_IP=NODE_IP
-
Export NFS server IP:
export NFS_SERVER=NFS_SERVER_IP
(If not set, the same above CLUSTER_NODE_IP will be used) -
Install Python requirements:
pip install -e .
-
Cache Docker images (Run and wait for "OK" message):
python auto_tuner/cache_images.py
...
- Python
- Kubernetes
- TensorFlow Serving
- Prometheus
Please use the following citation if you use this framework:
@inproceedings{salmani2023reconciling,
title={Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems},
author={Salmani, Mehran and Ghafouri, Saeid and Sanaee, Alireza and Razavi, Kamran and M{\"u}hlh{\"a}user, Max and Doyle, Joseph and Jamshidi, Pooyan and Sharifi, Mohsen},
booktitle={Proceedings of the 3rd Workshop on Machine Learning and Systems},
pages={78--86},
year={2023}
}