OpenWhisk distinguishes between system and user metrics (events).
System metrics typically contain information about system performance and provide a possibility to send them to Kamon or write them to log files in logmarker format. These metrics are typically used by OpenWhisk providers/operators.
User metrics encompass information about action performance which is sent to Kafka in a form of events. These metrics are to be consumed by OpenWhisk users, however they could be also used for billing or audit purposes. It is to be noted that at the moment the events are not directly exposed to the users and require an additional Kafka Consumer based micro-service for data processing.
Both capabilities can be enabled or disabled separately during deployment via Ansible configuration in the 'group_vars/all' file of an environment.
There are four configurations options available:
-
metrics_log [true / false (default: true)]
Enable/disable whether the metric information is written out to the log files in logmarker format.
Beware: Even if set to false all messages using the log markers are still written out to the log
-
metrics_kamon [true / false (default: false)]
Enable/disable whether metric information is sent to the configured StatsD server.
-
metrics_kamon_tags: false [true / false (default: false)]
Enable/disable whether to use the Kamon tags when sending metrics.
Notice: Tags are supported in only some Kamon backends. (OpenTSDB, Datadog, InfluxDB)
-
metrics_kamon_statsd_host [hostname or ip address]
Hostname or ip address of the StatsD server
-
metrics_kamon_statsd_port [port number (default:8125)]
Port number of the StatsD server
Example configuration:
metrics_kamon: true
metrics_kamon_tags: false
metrics_kamon_statsd_host: '192.168.99.100'
metrics_kamon_statsd_port: '8125'
metrics_log: true
The Kamon project provides an integrated docker image containing StatsD and a connected Grafana dashboard via this Github project. This image is helpful for testing the metrics sent via StatsD.
Please follow these instructions to start the docker image in your local docker environment.
The docker image exposes StatsD via the (standard) port 8125 and a Grafana dashboard via port 8080 on your docker host.
The address of your docker host has to be configured in the metrics_kamon_statsd_host
configuration property.
All metric names have to be prefixed by a prefix that you specify and are subject to modification by graphite, datadog, or statsd. For example if prefix used is openwhisk
then metric names would be like openwhisk.counter.controller_activation_start
. This document assumes that metric name prefix is openwhisk
Currently OpenWhisk emits following types of metrics
Counter record the count of metric and there names are prefixed with openwhisk.counter
. For example openwhisk.counter.controller_activation_start
. Counters just counts and resets to zero upon each flush.
Histogram record the distribution of given metric and there names are prefixed with openwhisk.histogram
. For example openwhisk.histogram.controller_activation_finish
. A histogram metrics may result in multiple values at the metric aggregator level. For example in Datadog for each histogram metric following values are record
my_metric.avg
- Average of aggregated values during the flush interval.my_metric.count
- Count of aggregated values during the flush interval.my_metric.median
- Median of aggregated values during the flush interval.my_metric.95percentile
- 95th percentile value of aggregated values during the flush interval.my_metric.max
- Max of aggregated values during the flush interval.my_metric.min
- Min of aggregated values during the flush interval.
Below are some of the important metrics emitted by OpenWhisk setup
Metrics below are emitted from within a Controller instance.
openwhisk.counter.controller_startup<controller_id>_count
(counter)- Example openwhisk.counter.controller_startup0_count
- Records count of controller instance startup
openwhisk.counter.controller_blockingActivationDatabaseRetrieval_count
(counter) - Records the count of activations the controller has retrieved from the activation store during blocking invocations
Following metrics record stats around activation handling within Controller
- Normal actions
openwhisk.counter.controller_activation_start
(counter) - Records the count of non blocking activations started.openwhisk.histogram.controller_activation_finish
(histogram) - Records the overall time taken for non blocking activation to be submitted to Load balancer.
- Blocking actions
openwhisk.counter.controller_blockingActivation_start
(counter) - Records the count of blocking activations started.openwhisk.histogram.controller_blockingActivation_finish
(histogram) - Records the time taken for a blocking activation to finish or timeout.
Aggregate metrics for inflight activations.
openwhisk.histogram.loadbalancer<controllerId>_activationsInflight_count
(histogram) - Records the number of activations being worked upon for a given controller. As a histogram it would give a distribution of inflight activation count within a flush interval.openwhisk.histogram.loadbalancer<controllerId>_memoryInflight_count
(histogram) - Records the amount of RAM memory in use for in flight activations. This is not actual runtime memory but the memory specified per action limits.
Metrics below are captured within load balancer
openwhisk.counter.loadbalancer_activations_count
(counter) - Records the count of activations sent to Kafka.openwhisk.counter.controller_kafka_start
(counter) - Records the count of activations sent to Kafka.openwhisk.counter.controller_kafka_error
(counter) - Records the count of activations which encountered some failure while submitting to Kafka.openwhisk.histogram.controller_kafka_finish
(histogram) - Records the time taken when activation was successfully submitted to Kafka.openwhisk.histogram.controller_kafka_error
(histogram) - Records the time taken when activation submission to Kafka resulted in failure.openwhisk.counter.controller_loadbalancer_start
(counter) - Records the count of activations submitted to load balancer.openwhisk.histogram.controller_loadbalancer_finish
(histogram) - Records the time taken to submit to load balancer.
Metrics below are for invoker state as recorded within load balancer monitoring.
openwhisk.counter.loadbalancer_invokerOffline_count
- Records the count of invokers considered offline based on health pings.openwhisk.counter.loadbalancer_invokerUnhealthy_count
- Records the count of invokers considered unhealthy based on health pings.
openwhisk.counter.invoker_activationInit_start
(counter) - Count of container initializations done.openwhisk.histogram.invoker_activationInit_finish
(histogram) - Time taken for successful container initializations.openwhisk.histogram.invoker_activationInit_error
(histogram) - Time taken container initialization failed. Count metrics of this histogram would give insight on failed initialization count.
openwhisk.counter.invoker_activationRun_start
(counter) - Count of action executions performed.openwhisk.histogram.invoker_activationRun_finish
(histogram) - Time taken for action execution for success case.openwhisk.histogram.invoker_activationRun_error
(histogram) - Time taken for action execution for failed cases. Count metrics of this histogram would give insight on failed execution count.
openwhisk.counter.invoker_containerStart.cold_count
(counter) - Count of number of cold starts.openwhisk.counter.invoker_containerStart.recreated_count
(counter) - Count of number of times container is recreated.openwhisk.counter.invoker_containerStart.warm_count
(counter) - Count of number of times a warm container is used.
openwhisk.counter.invoker_collectLogs_start
(counter) - Count of number of times log were collected.openwhisk.counter.invoker_collectLogs_error
(counter) - Count of number of failed logs collections.openwhisk.histogram.invoker_collectLogs_error
(histogram) - Time taken for failed log collection.openwhisk.histogram.invoker_collectLogs_finish
(histogram) - Time taken for successful log collection.
openwhisk.counter.invoker_activation_start
(counter) - Count of activations handled
Following metrics capture stats around various docker command executions.
- Pause
openwhisk.counter.invoker_docker.pause_start
openwhisk.counter.invoker_docker.pause_error
openwhisk.histogram.invoker_docker.pause_finish
openwhisk.histogram.invoker_docker.pause_error
- Ps
openwhisk.counter.invoker_docker.ps_start
openwhisk.counter.invoker_docker.ps_error
openwhisk.histogram.invoker_docker.ps_finish
openwhisk.histogram.invoker_docker.ps_error
- pull
openwhisk.counter.invoker_docker.pull_start
openwhisk.counter.invoker_docker.pull_error
openwhisk.histogram.invoker_docker.pull_finish
openwhisk.histogram.invoker_docker.pull_error
- rm
openwhisk.counter.invoker_docker.rm_start
openwhisk.counter.invoker_docker.rm_error
openwhisk.histogram.invoker_docker.rm_finish
openwhisk.histogram.invoker_docker.rm_error
- run
openwhisk.counter.invoker_docker.run_start
openwhisk.counter.invoker_docker.run_error
openwhisk.histogram.invoker_docker.run_finish
openwhisk.histogram.invoker_docker.run_error
- unpause
openwhisk.counter.invoker_docker.unpause_start
openwhisk.counter.invoker_docker.unpause_error
openwhisk.histogram.invoker_docker.unpause_finish
openwhisk.histogram.invoker_docker.unpause_error
Metrics below are emitted per kafka topic.
openwhisk.histogram.kafka_<topic name>.delay_start
- Time delay between when a message was pushed to kafka and when it is read within a consumer. This metric is recorded for every message read.openwhisk.histogram.kafka_<topic name>_count
- Records the Queue size of the topic. By default this metric is emitted every 60 secs.
Metrics per topic
cacheInvalidation
- Emitted per controller while reading the cache invalidation messages.openwhisk.histogram.kafka_cacheInvalidation.delay_start
openwhisk.histogram.kafka_cacheInvalidation_count.count
health
- Emitted per controller while reading the invoker health pings.openwhisk.histogram.kafka_health.delay_start
openwhisk.histogram.kafka_health_count
completed<controllerId>
- Topic to receive completed activations. This is emitted per controller for its own topic. For example for controller id 0 metric names would beopenwhisk.histogram.kafka_completed0.delay_start
openwhisk.histogram.kafka_completed0_count
invoker<invokerId>
- Topic to receive activations to complete. This is emitted per invoker for its own topic. For example for invoker id 0 metric names would beopenwhisk.histogram.kafka_invoker0_count
openwhisk.histogram.kafka_invoker0.delay_start
openwhisk.counter.database_cacheHit_count
- Count of cache hits.openwhisk.counter.database_cacheMiss_count
- Count of cache misses.
Metrics below are emitted for database related operations and follow a pattern
openwhisk.counter.database_<operation type>_start
- Count of database operations done for given type. Exampleopenwhisk.counter.database_getDocument_start
.openwhisk.counter.database_<operation type>_error
- Count of database operations done for given type which resulted in error. Exampleopenwhisk.counter.database_getDocument_error
.openwhisk.histogram.database_<operation type>_finish
- Time taken for successful completion of given database operation. Exampleopenwhisk.histogram.database_getDocument_finish
.openwhisk.histogram.database_<operation type>_error
- Time taken for failed completion of given database operation. Exampleopenwhisk.histogram.database_getDocument_error
.
Operation Types
deleteDocument
getDocument
queryView
saveDocument
saveDocumentBulk
When database used is CosmosDB then metrics related to CosmosDB Resource Units is also emitted.
If Kamon tags are enabled then metric name is openwhisk.counter.cosmosdb_ru_used
with following tags
mode
-read
orwrite
collection
- Name of collection. Exampleactivations
,whisks
andsubjects
action
- Type of operation performed. Exampleget
,put
,del
,query
andcount
If Kamon tags are not enabled then metric name is of the form openwhisk.counter.cosmosdb.ru.<collecton>.<action>
User metrics are enabled by default and could be explicitly disabled by setting the following property in one of the Ansible configuration files:
user_events: false
Activation is an event that occurs after after each activation. It includes the following execution metadata:
waitTime - internal system hold time
initTime - time it took to initialize an action, e.g. docker init
statusCode - status code of the invocation: 0 - success, 1 - application error, 2 - action developer error, 3 - internal OpenWhisk error
duration - actual time the action code was running
kind - action flavor, e.g. Node.js
conductor - true for conductor backed actions
memory - maximum memory allowed for action container
causedBy - contains the "causedBy" annotation (can be "sequence" or nothing at the moment)
Metric is any user specific event produced by the system and it at this moment includes the following information:
ConcurrentRateLimit - a user has exceeded its limit for concurrent invocations.
TimedRateLimit - the user has reached its per minute limit for the number of invocations.
ConcurrentInvocations - the number of in flight invocations per user.
Example events that could be consumed from Kafka. Activation:
{"body":{"statusCode":0,"duration":3,"name":"whisk.system/invokerHealthTestAction0","waitTime":583915671,"conductor":false,"kind":"nodejs:6","initTime":0,"memory": 256, "causedBy": false},"eventType":"Activation","source":"invoker0","subject":"whisk.system","timestamp":1524476122676,"userId":"d0888ad5-5a92-435e-888a-d55a92935e54","namespace":"whisk.system"}
Metric:
{"body":{"metricName":"ConcurrentInvocations","metricValue":1},"eventType":"Metric","source":"controller0","subject":"guest","timestamp":1524476104419,"userId":"23bc46b1-71f6-4ed5-8c54-816aa4f8c502","namespace":"guest"}