diff --git a/docs/kepler_model_server/get_started.md b/docs/kepler_model_server/get_started.md index a20693a6..083af84c 100644 --- a/docs/kepler_model_server/get_started.md +++ b/docs/kepler_model_server/get_started.md @@ -1,8 +1,10 @@ + # Get Started with Kepler Model Server Model server project facilitates tools for power model training, exporting, serving, and utilizing based on Kepler-exporting energy-related metrics. Check the following steps to get started with the project. ## Step 1: Learn about Pipeline + The first step is to understand about power model building concept from [training pipeline.](./pipeline.md) ## Step 2: Learn how to use the power model @@ -11,9 +13,9 @@ The first step is to understand about power model building concept from [trainin --- -There are two ways to use the models regarding the model format. If the model format can be processed directly inside the Kepler exporter such as Linear Regression weight in `json` format. There is no extra cofiguration. +There are two ways to use the models regarding the model format. If the model format can be processed directly inside the Kepler exporter such as Linear Regression weight in `json` format. There is no extra cofiguration. -However, if the model is in the general format archived in `zip`, It is needed to enable the estimator sidecar via environment variable or Kepler config map. +However, if the model is in the general format archived in `zip`, It is needed to enable the estimator sidecar via environment variable or Kepler config map. ```bash export NODE_COMPONENTS_ESTIMATOR=true @@ -35,10 +37,11 @@ data: ### Select power model --- -There are two ways to obtain power model: static and dynamic. +There are two ways to obtain power model: static and dynamic. #### Static configuration -A static way is to download the model directly from `INIT_URL`. It can be set via environment variable directly or via `kepler-cfm` Kepler config map. For example, + +A static way is to download the model directly from `INIT_URL`. It can be set via environment variable directly or via `kepler-cfm` Kepler config map. For example, ```bash export NODE_COMPONENTS_INIT_URL= < Static URL > @@ -57,9 +60,10 @@ data: NODE_COMPONENTS_INIT_URL= < Static URL > ``` -The static URL from standard pipeline v0.6 (std_v0.6) are listed [here](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.6/nx12). +The static URL from provided pipeline v0.7 are listed [here](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.7). #### Dynamic via server API + A dynamic way is to enable the model server to auto select the power model which has the best accuracy and supported the running cluster environment. Similarly, It can be set via the environment variable or set it via Kepler config map. ```bash @@ -85,4 +89,4 @@ See more in [Kepler Power Estimation Deployment](./power_estimation.md) As you may be aware, it's essential to tailor power models to specific machine types rather than relying on a single generic model. We eagerly welcome contributions from the community to help build alternative power models on your machine through the model server project. -For detailed guidance on model training, please refer to our model training guidelines [here](https://github.com/sustainable-computing-io/kepler-model-server/tree/main/model_training). \ No newline at end of file +For detailed guidance on model training, please refer to our model training guidelines [here](https://github.com/sustainable-computing-io/kepler-model-server/tree/main/model_training). diff --git a/docs/kepler_model_server/node_profile.md b/docs/kepler_model_server/node_profile.md index 2c57a52a..ca8b2d41 100644 --- a/docs/kepler_model_server/node_profile.md +++ b/docs/kepler_model_server/node_profile.md @@ -2,4 +2,12 @@ We form a group of machines (nodes) called [node type](./pipeline.md#node-spec) based on processor model, the number of cores, the number of chips, memory size, and maximum CPU frequency. When collecting the data from the bare metal machine, these attributes are automatically extracted and kept as a machine spec in json format. -A power model will be built per node type. For each group of node type, we make a profile composing of background power when the resource usage is almost constant without user workload, minimum, maximum power for each power components (e.g., core, uncore, dram, package, platform), and normalization scaler (i.e., MinMaxScaler), standardization scaler (i.e., StandardScaler) for each [feature group](./pipeline.md#available-metrics). \ No newline at end of file +A power model will be built per node type. For each group of node type, we make a profile composing of background power when the resource usage is almost constant without user workload, minimum, maximum power for each power components (e.g., core, uncore, dram, package, platform), and normalization scaler (i.e., MinMaxScaler), standardization scaler (i.e., StandardScaler) for each [feature group](./pipeline.md#available-metrics). + +Node specification is composed of: + +- processor *- CPU processor model name* +- cores *- Number of CPU cores* +- chips *- Number of chips* +- memory_gb *- Memory size in GB* +- cpu_freq_mhz *- Maximum CPU frequency in MHz* diff --git a/docs/kepler_model_server/pipeline.md b/docs/kepler_model_server/pipeline.md index 25395a3f..514e67ed 100644 --- a/docs/kepler_model_server/pipeline.md +++ b/docs/kepler_model_server/pipeline.md @@ -1,19 +1,27 @@ # Training Pipeline -Model server can provide various power models for different context and learning methods. Training pipeline is an abstract of power model training that applies a set of learning methods to a different combination of energy sources, power isolation methods, available energy-related metrics. +Model server can provide various power models for different context and learning methods. Training pipeline is an abstract of power model training that applies a set of learning methods to a different combination of energy sources, power isolation methods, available energy-related metrics. ## Pipeline -`pipeline` is composed of three steps, `extract`, `isolate`, and `train`, as shown below. Kepler exports energy-related metrics as Prometheus counter, which provides accumulated number over time. The `extract` step is to convert the counter metrics to gauge metrics, similarly to the Prometheus `rate()` function, giving per-second values. The extract step also clean up the data separately for each `feature group`. The power consumption retrieved from the Prometheus query is the measured power which is composed of the power portion which is varied by the workload called dynamic power and the power portion which is consumed even if in the idling state called idle power. The `isolate` step is to calculate the idle power and isolate the dynamic power consumption of each `energy source`. The `train` step is to apply each `trainer` to create multiple choices of power models based on the preprocessed data. -> We have a roadmap to apply a pipeline to build power models separately for each node/machine type. Find more in [Node Type](#node-type) section. +`pipeline` is composed of three steps, `extract`, `isolate`, and `train`, as shown below. Kepler exports energy-related metrics as Prometheus counter, which provides accumulated number over time. + +The `extract` step is to convert the counter metrics to gauge metrics, similarly to the Prometheus `rate()` function, giving per-second values. The extract step also clean up the data separately for each `feature group`. + +The power consumption retrieved from the Prometheus query is the measured power which is composed of the power portion which is varied by the workload called dynamic power and the power portion which is consumed even if in the idling state called idle power. + +The `isolate` step is to calculate the idle power and isolate the dynamic power consumption of each `energy source`. + +The `train` step is to apply each `trainer` to create multiple choices of power models based on the preprocessed data. +> We have a roadmap to apply a pipeline to build power models separately for each node/machine type. Find more in [Node Type](#node-type) section. -![](../fig/pipeline_plot.png) +![Pipeline Plot](../fig/pipeline_plot.png) - Learn more about `energy source` from [Energy source](#energy-source) section. -- Learn more about `feature group` from [Feature groups](#feature-groups) section. +- Learn more about `feature group` from [Feature groups](#feature-group) section. - Learn more about the `isolate` step and corresponding concepts of `AbsPower`, and `DynPower` power models from [Power isolation](#power-isolation) section. -- Check available `trainer` in [Trainer](#learning-methods) section. +- Check available `trainer` in [Trainer](#trainer) section. ## Energy source @@ -23,7 +31,6 @@ Energy/power source|Energy/power components ---|--- [rapl](../design/kepler-energy-sources.md#rapl---running-average-power-limit)|package, core, uncore, dram [acpi](../design/kepler-energy-sources.md#using-kernel-driver-xgene-hwmon)|platform -|| ## Feature group @@ -37,39 +44,47 @@ BPFOnly|BPF_FEATURES|[BPF](../design/metrics.md#base-metric) IRQOnly|IRQ_FEATURES|[IRQ](../design/metrics.md#irq-metrics) AcceleratorOnly|ACCELERATOR_FEATURES|[Accelerator](../design/metrics.md#Accelerator-metrics) CounterIRQCombined|COUNTER_FEATURES, IRQ_FEATURES|BPF and Hardware Counter -Basic|COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES, KUBELET_FEATURES|All except IRQ and node information -WorkloadOnly|COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES, IRQ_FEATURES, KUBELET_FEATURES, ACCELERATOR_FEATURES|All except node information +Basic|COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES|All except IRQ and node information +WorkloadOnly|COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES, IRQ_FEATURES, ACCELERATOR_FEATURES|All except node information Full|WORKLOAD_FEATURES, SYSTEM_FEATURES|All -|| Node information refers to value from [kepler_node_info](../design/metrics.md#kepler-metrics-for-node-information) metric. ## Power isolation -The power consumption retrieved from the Prometheus query is the absolute power, which is the sum of idle and dynamic power (where idle represents the system at rest, dynamic is the incremental power with resource utilization, and absolute is idle + dynamic). Additionally, this power is also the total power consumption of all process, including the users' workload, background and OS processes. The `isolate` step applies a mechanism to separate idle power from absolute power, resulting in dynamic power It also covers an implementation to separate the dynamic power consumed by background and OS processes (referred to as `system_processes`). It's important to note that both the idle and dynamic `system_processes` power are higher than zero, even when the metric utilization of the users' workload is zero. +The power consumption retrieved from the Prometheus query is the absolute power, which is the sum of idle and dynamic power (where idle represents the system at rest, dynamic is the incremental power with resource utilization, and absolute is idle + dynamic). Additionally, this power is also the total power consumption of all process, including the users' workload, background and OS processes. + +The `isolate` step applies a mechanism to separate idle power from absolute power, resulting in dynamic power It also covers an implementation to separate the dynamic power consumed by background and OS processes (referred to as `system_processes`). + +It's important to note that both the idle and dynamic `system_processes` power are higher than zero, even when the metric utilization of the users' workload is zero. > We have a roadmap to identify and isolate a constant power portion which is significantly increased at a specific resource utilization called `activation power` to fully isolate all constant power consumption from the dynamic power. We refer to models trained using the isolate step as `DynPower` models. Meanwhile, models trained without the isolate step are called `AbsPower` models. Currently, the `DynPower` model does not include idle power information, but we plan to incorporate it in the future. -There are two common available `isolators`: *ProfileIsolator* and *MinIdleIsolator*. +There are two common available `isolators`: *ProfileIsolator* and *MinIdleIsolator*. *ProfileIsolator* relies on collecting data (e.g., power and resource utilization) for a specific period without running any user workload (referred to as profile data). This isolation mechanism also eliminates the resource utilization of `system_processes` from the data used to train the model. -On the other hand, *MinIdleIsolator* identifies the minimum power consumption among all samples in the training data, assuming that this minimum power consumption represents both the idle power and `system_processes` power consumption. While we should also remove the minimal resource utilization from the data used to train the model, this isolation mechanism includes the resource utilization by `system_processes` in the training data. However, we plan to remove it in the future. +On the other hand, *MinIdleIsolator* identifies the minimum power consumption among all samples in the training data, assuming that this minimum power consumption represents both the idle power and `system_processes` power consumption. + +While we should also remove the minimal resource utilization from the data used to train the model, this isolation mechanism includes the resource utilization by `system_processes` in the training data. However, we plan to remove it in the future. If the `profile data` that matches a given `node_type` exist, the pipeline will use the *ProfileIsolator* to preprocess the training data. Otherwise, the the pipeline will applied another isolation mechanism, such as the *MinIdleIsolator*. (check how profiles are generated [here](./node_profile.md)) -> The choice between using the `DynPower` or `AbsPower` model is still under investigation. In some cases, DynPower exhibits better accuracy than `AbsPower`. However, we currently utilize the `AbsPower` model to estimate node power for Platform, CPU and DRAM components, as the `DynPower` model lacks idle power information. +### Discussion + +The choice between using the `DynPower` or `AbsPower` model is still under investigation. In some cases, DynPower exhibits better accuracy than `AbsPower`. However, we currently utilize the `AbsPower` model to estimate node power for Platform, CPU and DRAM components, as the `DynPower` model lacks idle power information. -> It's worth mentioning that exposing idle power on a VM in a public cloud environment is not possible. This is because the host's idle power must be distributed among all running VMs on the host, and it's impossible to determine the number of VMs running on the host in a public cloud environment. Therefore, we can only expose idle power if there is only one VM running on the node (for a very specific scenario), or if the power model is being used in Bare Metal environments. +It's worth mentioning that exposing idle power on a VM in a public cloud environment is not possible. This is because the host's idle power must be distributed among all running VMs on the host, and it's impossible to determine the number of VMs running on the host in a public cloud environment. +Therefore, we can only expose idle power if there is only one VM running on the node (for a very specific scenario), or if the power model is being used in Bare Metal environments. ## Trainer -`trainer` is an abstraction to define the learning method applies to each feature group with each given power labeling source. +`trainer` is an abstraction to define the learning method applies to each feature group with each given power labeling source. Available trainer (v0.6): @@ -82,4 +97,4 @@ Available trainer (v0.6): ## Node type -Kepler forms multiple groups of machines (nodes) based on its benchmark performance and trains a model separately for each group. The identified group is exported as `node type`. +Kepler forms multiple groups of machines (nodes) based on its benchmark performance and trains a model separately for each group. The identified group is exported as `node type`. diff --git a/docs/kepler_model_server/power_estimation.md b/docs/kepler_model_server/power_estimation.md index b699fe0f..a3733095 100644 --- a/docs/kepler_model_server/power_estimation.md +++ b/docs/kepler_model_server/power_estimation.md @@ -1,9 +1,10 @@ # Kepler Power Estimation Deployment -In Kepler, we also provide a power estimation solution from the resource usages in the system that there is no power measuring tool installed or supported. +In Kepler, we also provide a power estimation solution from the resource usages in the system that there is no power measuring tool installed or supported. There are two alternatives of estimators. ## Estimators + - **Local Linear Regression Estimator**: This estimator estimates power using the trained weights multiplied by normalized value of usage metrics (Linear Regression Model). - **General Estimator Sidecar**: This estimator transforms the usage metrics and applies with the trained models which can be any regression models from scikit-learn library or any neuron networks from Keras (TensorFlow). To use this estimator, the Kepler estimator needs to be enabled. @@ -12,37 +13,38 @@ On top of that, the trained models as well as weights can be updated periodicall ## Deployment Scenarios -**Minimum Deployment** +### Minimum Deployment -The minimum deployment is to use local linear regression estimator in Kepler main container with only offline-trained model weights. +The minimum deployment is to use local linear regression estimator in Kepler main container with only offline-trained model weights. -![](../fig/minimum_deploy.png) +![Minimum Deployment](../fig/minimum_deploy.png) -**Deployment with General Estimator Sidecar** +### Deployment with General Estimator Sidecar -To enable general estimator for power inference, the estimator sidecar can be deployed as shown in the following figure. +To enable general estimator for power inference, the estimator sidecar can be deployed as shown in the following figure. The connection between two containers is a unix domain socket which is lightweight and fast. Unlike the local estimator, the general estimator sidecar is instrumented with several inference-supportive libraries and dependencies. This additional overhead must be tradeoff to an increasing estimation accuracy expected from flexible choices of models. -![](../fig/disable_model_server.png) +![Estimator Integration](../fig/disable_model_server.png) -**Minimum deployment connecting to Kepler Model Server** +### Minimum deployment connecting to Kepler Model Server To get the updated weights which is expected to provide better estimation accuracy, Kepler may connect to remote Kepler Model Server that performs online training using data from the system with the power measuring tool as below. -![](../fig/disable_estimator_sidecar.png) +![Model Server Integration](../fig/disable_estimator_sidecar.png) -**Full deployment** +### Full deployment -The following figure shows the deployment that Kepler General Estimator is enabled and it is also connecting to remote Kepler Model Server. +The following figure shows the deployment that Kepler General Estimator is enabled and it is also connecting to remote Kepler Model Server. The Kepler General Estimator sidecar can update the model from the Kepler Model Server on the fly and expect the most accurate model. -![](../fig/full_integration.png) - +![Full Integration](../fig/full_integration.png) -## Power model accuracy report +## Provided power models on Kepler Model DB -version|machine ID|pipeline|feature group|component power source|total power source|Local LR MAE in watts (Node Components/Total)|Estimator Sidecar MAE in watts (Node Components/Total)|Reference Power Range in watts ----|---|---|---|---|---|---|---|--- -[0.6](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.6)|[nx12](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.6/nx12)|[std_v0.6](https://github.com/sustainable-computing-io/kepler-model-db/blob/main/models/v0.6/.doc/std_v0.6.md)|BPFOnly|rapl|acpi|66.32/93.57|34.40/49.52|505.79 +version|power data source|pipeline|available energy sources|error report +---|---|---|---|--- +[0.6](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.6)|[nx12](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.6/nx12)|[std_v0.6](https://github.com/sustainable-computing-io/kepler-model-db/blob/main/models/v0.6/.doc/std_v0.6.md)|rapl,acpi|[Link](https://github.com/sustainable-computing-io/kepler-model-db/blob/main/models/v0.6/nx12/README.md) +[0.7](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.7)|[SPECpower](https://www.spec.org/power_ssj2008/)|[specpower](https://github.com/sustainable-computing-io/kepler-model-db/blob/main/models/v0.7/.doc/specpower.md)|acpi|[Link](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.7/specpower) +[0.7](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.7)|[Training Playbook](https://github.com/sustainable-computing-io/kepler-model-training-playbook)|[ec2](https://github.com/sustainable-computing-io/kepler-model-db/blob/main/models/v0.7/.doc/ec2.md)|intel_rapl|[Link](https://github.com/sustainable-computing-io/kepler-model-db/tree/main/models/v0.7/ec2)