- Topics
- Health Modeling
- Monitoring Implementation
- Log Analytics Workspace
- Diagnostic Settings
- Action Groups
- Multi Resource Alerts
- Service Health
- Resource Health
- Virtual Machines
- Recovery Vaults
- Containers
- Storage
- App Service
- Visualisation
Following mapping out of the Service we now need to determine what are the Key Performance Indicators (KPI) are for each component, how they are recorded, how we can monitor them and if applicable what threshold should be set for alerting. These will need to be recored so we can implement the alerts and dashboards following KPI disucussion. An individual Monitoring Pattern sheet should be saved for customer recording what thresholds and key performance indicators have been set to be configured later. For a list of supported metrics that can be used for each resource type please refer to the platform metrics documentation.
Alerts are notifications of system health issues that are found during monitoring. Alerts only deliver value if they are actionable and effectively prioritized. Careful consideration needs to be made to confimr if an alert needs to be rasied or it should be included in dashboards or visualisations.
Visualisations need to be created for an operational dashboards to give insights into the application health and enable quick diagnosis of issue that will effect availability and health of the application. Deeper insights can be provided through workbooks. For each KPI it should noted if it is to be included in a cental operations dashboard in the Monitoring Pattern. It should also be noted that many of the components will have built in visualisations, insights and workbooks that can be used for deeper insights but also can be be pinned to operational dashboards.
For each component discovered in the service scoping it will need to be discussed if diagnostics settings are required to be enabled for resource logs and metrics. These can be observabilitly, audit, security or compliance purposes.
Discussion needs to made on the set up action groups. Do different people or teams need to be informed dependent on the components or issue? Does the customer needs alerts sent to a 3rd party system or ITSM tools? Action Groups will only be set up for email notifcations in this release . If other actions are required for alerts these can be added subsequently to the action groups created. Action groups carry a short and long name. These should adhere to naming convention agreed and in line with other naming conventions already set within current Azure deployment.
In this section we cover off the various components discovered in the workshop to discuss what KPI are for each components and what thresholds for alerts we can set up for them.
Review current Service and Resource Health and confirm the criteria for the service and resource health alerts and if customer wants alerts for all service health alerts and the criteria they want for the resource health alerts. Example has been set in the Monitoring Pattern sheet. Careful consideration should be taken to see if customer wants to be alerted for User Initiated as well as Platform Initiated resource health events. An example of a user initiated event would be if a virtual machine is gracefully shut down. It is advised, to ensure you do not get false postives for unknown health events, to filter out unknown health events. The example template in the reposistory filters out these events
Review key components in design to confirm if any activity log alerts should be enabled. Examples could include route table or NSG updates or deletion of resources in production. Other examples could be for Autoscale events of VMs or failure to Autoscale. An example arm template can be found here. The activity log should be also be collected into a Log Analytics workspace. The builtin Policy Definition name "Configure Azure Activity logs to stream to specified Log Analytics workspace" should be used to configure this.
Discuss what endpoints need to be monitored and if they are publically or internally available. These should be recorded in the Monitoring Pattern for later reference. It can then be determined if these needs to be monitored via connection monitor or application insights
Discuss what Virtual Machines are in place, would they need the same threshold set for them or would different thresholds need to be set for different VMS in scope? Some example thresholds have been set in the Monitoring Pattern sheet. Review current CPU, Memory usage via metric exploror to determine current levels and what threshold should be set initially.
AKS cluster now supports predefined alerts . Review these alerts to confirm if customers happy with enabling them.
A key vault availability can be crucial for operation of the application. Example thesholds have been set in in the Monitoring Pattern sheet for some key metrics in the operation or Key Vault. The incldue key vault saturatios and availability.
Review the storage metrics available for effecttive monitoring. Example threshold for non-dimension specific alerts have been set in the example Monitoring Pattern
Storage also has a set dimension for metrics for more.effective monitoring for known errors depending on the storage type. Review to see if further alerts should be set.
< App Insights>
Customer should confirm if new workspace needs to be created or an existing one can be used?
Please review Workspace Consideratiions and Management and Monitoring of Enterpise scale documentation when deciding to create new or existing workspace and who can access it.
The workspace will be used to collect any relevent event and log data from the various sources we need to monitoring in Azure.
Please follow the Create a Log Analytics Workspace if a new workpsace needs to be created.
Review of diagnostics settings for all key components. Enable If Diagnostics are required for observability purposes.
Diagnostics should be set via Policy. Review each resource type dto determinse if diagnotis are required and if they need to be set. A policy intiative can be set for diagnostic settings and should be set at a Management Group level if possible. Not all resoure type have builtin policies. A custom initiative can be set using the Diagnostics Policy Scripts repository that can set up a policy intiative for diagnostics ,
Following KPI review the relevant action groups should be set up for the alerts. These will be the relevant mail teams for each compent that has been discussed in the monitoring disuccsion workshop. An example arm template can be found in the templates section of the repository with relevant parameter file.
Where possible and if able to scope multiresource alerts should be set. This enables one alert rule to be configred for many resources and targgeted via a scope at Resource Group or Sunscriptions LEvel The currently supported resource.
Service health alerts should be set up in adherence to the the KPI disucssion session. Example templates for setting up a service health alert can be found here
Resource Health alert should be set up in accordance with KPI discussion. Example arm templates for deployment can be found in the templates section. The example template only alerts on Unavailable or Degraded health events that have been Platform initiated.
To monitor Virtual Machines we will use VM Insights . This gives us a standardized and recommend set of data collection and builtin visualisations to help indentify performance bottlenecks and dependencies
Please follow Add VM Insights to add the solution to the workspace.
The virtual machines will require the agents to be installed. It is recommended to use Azure Policy to deploy the agents. Please use the Enable VM Insights initiative to deploy the Log analytics and Dependency agent to the targetted VMs at the relevant scope.
Where possible metric alerts should be set in accordance to pattern matching. This enables real time monitoring sceanrios. If log alerts are necessary please review the sample alert queries and raise the necessary alerts. It is recommended that the log analytics alerts are set to be stateful which is currently in preview.
Recovery vaults diagnostics should be set up using policy. Please refer to configure Vault Diagnostics at scale to set this up. If following KPI disucssion we are to use the Recovery Vault BackUps Alerts. These need to be configured directly on the console as there is PS/CLI/REST API/Azure Resource Manager Template support for these alerts. If other alerting needs to be set up please refer to the
For the AKS cluster please enable container insights using one fo the methods listed. For long term supportability using the customer policy would be the recommended method for deployment.
Container insigts provide a recommended set of [metric alerts]
For storage review the Storage Insights which is automatically enabled for any storage account.
Ensure diagnostic settings are enabled for the app service and https://docs.microsoft.com/en-us/azure/app-service/monitor-app-service
An operational dashboard should be created for an overview of the applications health. This should be able to give quick insights and observability to the application's health. For deeper insights workbooks should be considered. Visualisations can be pinned for segments of workbooks, directly from metrics explorer or using KQL queries that are scoped to the applcations monitoring pattern. For each of the elements review the Azure Monitor - Workbooks public templates. These will contain pinnable and customisable workbooks for a variety of key components.