Kepler Operator v1 Working Doc

Design and spec discussion Kepler Operator v1

Following the existing discussion here

CR and Controllers:

The proposed CRs

      flowchart TD;
     machine-config
     kepler-system
     kepler-collected-metric
     kepler-exported-power

Instead of using integrated-operator-install to install prometheus and grafana via the operator, it should be left upon the user to set up the monitoring stack.
Each components should be represented as a separate CR and managed by a separate controller

`kepler-system`

apiVersion: sustainability-computing-io/v1aplha1
kind: Kepler
metadata:
 name: kepler-system
 namespace: kepler-system
spec:
 scape-interval:
 daemon:
   exporter:
     image:
     port: (default: 9102)
   estimator-sidecar:
     enabled: (default: false)
     image:
     mnt-path: (default: /tmp)
 model-server:
     enabled: (default: :warning:false)
     storage:
       type: (default: local? , values: local, hostpath, nfs, external (such as via s3))
       path: (default: models)
     sampling-period:

Open Questions

kepler-collectd-metric and kepler-exported-power What these components are meant to do ? It seems like they set some configurations. Where does Kepler use these configs?
- kepler-collectd-metric: list of metrics to collect by collector pkg separated by input source.
- kepler-exported-power: list of metrics to export to prometheus for each level (node, package, pod)
  
  Now these configurations are in two locations: exporter.go, config.go However, these sections are supposed to be refactored set as environments via config map.
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: kepler-cfm
  namespace: monitoring
data:
  SOURCE.COUNTER: enabled
  SOURCE.CGROUP: enabled
  SOURCE.KUBELET: enabled
  SOURCE.GPU: enabled
  EXPORT_METRICS: cpu_cycles, cached_miss, cpu_time, ...
```
  Currently, the list of metrics from each source (COUNTER,CGROUP,etc.) is fixed for grouping power models. Shall we change? (low priority)
Should the Operator for now just expose model weights and and have an option to also enable online training as long as energy metrics are supported (or should the operator just use the model server for exposing the models). If we want to enable online training, how do we intend to store the new models. Should it just be stored as PVs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kepler Operator v1 Working Doc

Design and spec discussion Kepler Operator v1

CR and Controllers:

`kepler-system`

Open Questions

Clone this wiki locally