Skip to content

Commit

Permalink
Update monitoring guide images & broken path (#168)
Browse files Browse the repository at this point in the history
* Update > Guides > Monitoring > Images

* Update broken path in docs/guides/hosting-guardrails/index.md & cleanup

* Update image to be specific and cleanup

* Content cleanup

* Update > image and text for investigate event flood

* Cleanup and broke docs/guides/hosting-guardrails/monitoring/investigate-event-flood/index.md single step#6 to multiple steps for clarity

---------

Co-authored-by: raj <[email protected]>
  • Loading branch information
RahulSrivastav14 and rajlearner17 authored Oct 18, 2024
1 parent b50af84 commit 7615a55
Show file tree
Hide file tree
Showing 20 changed files with 35 additions and 24 deletions.
10 changes: 5 additions & 5 deletions docs/guides/hosting-guardrails/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ The Guardrails Enterprise installation is highly customizable, allowing you to d

| |
| - | -
| [Architecture Guide](architecture) | Detailed logical, physiscal, network and application architecture information on hosting guardrails.
| [Installation Guides](installation) | Guides to install Guardrails in your AWS account.
| [Montoring Guides](monitoring) | How to proactivly monitor your Guardrails infrastructure.
| [Recovery Guides](restore) | How to recover a Guardrails environment from backup.
| [Troubleshooting Guides](troubleshooting) | How to assess and fix common hosting issues.
| [Architecture Guide](guides/hosting-guardrails/architecture) | Detailed logical, physical, network and application architecture information on hosting guardrails.
| [Installation Guides](guides/hosting-guardrails/installation) | Guides to install Guardrails in your AWS account.
| [Monitoring Guides](guides/hosting-guardrails/monitoring) | How to proactively monitor your Guardrails infrastructure.
| [Recovery Guides](guides/hosting-guardrails/restore) | How to recover a Guardrails environment from backup.
| [Troubleshooting Guides](guides/hosting-guardrails/troubleshooting) | How to assess and fix common hosting issues.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -53,19 +53,22 @@ Choose **Log Groups** from the left navigation menu.

## Step 6: Search Log Group

Search for log groups with the prefix **/aws/lambda/turbot_** followed by the workspace version.
Search for log groups with a key word based on the workspace version received from [Step 3](#step-3-view-logs), this will render list of matching Log group names with the prefix `/aws/lambda/turbot_` followed by the workspace version

![Search Log Group](/images/docs/guardrails/guides/hosting-guardrails/monitoring/diagnose-control-error/cloudwatch-log-groups-select.png)

## Step 7: Select Log Group

Select the **worker** log group as indicated in the **type** field from the error log in the Guardrails console. Choose **Search all log steams**.
Select the worker log group as indicated in the type field from the error log in the Guardrails console. E.g. select `/aws/lambda/turbot_5_47_2_rc_1_worker`. Choose **Search all log steams**.

![Worker Log Group](/images/docs/guardrails/guides/hosting-guardrails/monitoring/diagnose-control-error/cloudwatch-select-search-all-log-streams.png)

## Step 8: Search Error

Search using the **errorId** retrieved from the Guardrails console control error log.
Search using the `errorId` from [Step 3](#step-3-view-logs) from the Guardrails console control error log.

> [!NOTE]
> Ensure to provide the errorId in double quotes e.g. "3423432-dfdsf-3e331-fgdfgd234234"
![Search with Error Id](/images/docs/guardrails/guides/hosting-guardrails/monitoring/diagnose-control-error/cloudwatch-loggroups-search-with-errorid.png)

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -31,33 +31,38 @@ Choose **Dashboards** from the left navigation menu.

## Step 3: Select Dashboard

Select the Turbot Guardrails Enterprise (TE) CloudWatch dashboard, which is typically named after the TE version in use.
In **Custom dashboards**, select the Turbot Guardrails Enterprise (TE) CloudWatch dashboard, which is typically named after the TE version in use.

![TE Dashboard](/images/docs/guardrails/guides/hosting-guardrails/monitoring/investigate-event-flood/cloudwatch-select-te-dashboard.png)

## Step 4: View Events Queue

Select the duration and check the **Events Queue Backlog** graph in the TE CloudWatch dashboard that indicates the flood state.
Select the desired duration from the time range option in the top-right corner, and check the **Events Queue Backlog** graph in the TE CloudWatch dashboard for spikes indicating a event flood state.

![Events Queue Backlog](/images/docs/guardrails/guides/hosting-guardrails/monitoring/investigate-event-flood/cloudwatch-dashboard-events-queue-backlog.png)

## Step 5: Identify Noisy Tenant

In the **Activities** section of the TE Dashboard, use the **View All Messages By Workspace** widget to filter and identify the noisy tenant causing the issues.
Scroll down in the same dashboard page to the **Activities** section, use the **View All Messages By Workspace** widget to filter and identify the noisy tenant causing the issues.
The number of messages received by the top tenant over a specified duration, along with the difference between the top three tenants, can be a strong indicator of an event flood.

![View All Messages By Workspace](/images/docs/guardrails/guides/hosting-guardrails/monitoring/investigate-event-flood/cloudwatch-view-messages-by-workspace.png)

## Step 6: Identify Cause
## Step 6: Analyze Log Insights

With the workspace identified, navigate to **CloudWatch > Logs Insights**, select the appropriate worker log group for the TE version and choose the desired query duration to proceed to investigate further by analyzing events, event sources, and account IDs for the workspace.
With the workspace identified from the above step, navigate to **CloudWatch > Logs Insights**, select the appropriate worker log group for the TE version(s) and choose the desired query duration to proceed to investigate further by analyzing events, event sources, and account IDs for the workspace. This will render the query editor with the selected log group(s).

> [!IMPORTANT]
> Longer durations will increase the log group size and query time, which may result in higher billing costs for CloudWatch.
![View All Messages By Workspace](/images/docs/guardrails/guides/hosting-guardrails/monitoring/investigate-event-flood/cloudwatch-log-insights.png)

Use this query to identify **External Messages by Accounts in a Tenant**.
> [!NOTE]
> You can select multiple TE version log groups if required.
## Step 7: External Messages by Accounts in a Tenant

In the query editor, use the below query to identify AWS `AccountId(s)` contributing to the events.

```
fields @timestamp, @message
Expand All @@ -68,7 +73,9 @@ fields @timestamp, @message
```
![Accounts Generating Events](/images/docs/guardrails/guides/hosting-guardrails/monitoring/investigate-event-flood/cloudwatch-log-insights-events-by-account.png)

Next, use this query to identify **External Messages by Source for a Tenant**.
## Step 8: External Messages by Source for a Tenant

Use below query to identify specific event `Source` from the different services.

```
fields @timestamp, @message
Expand All @@ -80,7 +87,9 @@ fields @timestamp, @message

![Event Source](/images/docs/guardrails/guides/hosting-guardrails/monitoring/investigate-event-flood/cloudwatch-log-insights-event.source.png)

Use this query to further identify the specific event name for the source.
## Step 9: External Messages by Event Name

Use below query to identify the specific `EventName` associated with the service.

```
fields @timestamp, @message
Expand All @@ -90,10 +99,9 @@ fields @timestamp, @message
| sort Count desc | limit 5
```

![Specific Event Name](/images/docs/guardrails/guides/hosting-guardrails/monitoring/investigate-event-flood/cloudwatch-log-insights-source-breakdown.png)

## Step 7: Measures To Fix Event Flood
## Step 10: Measures To Fix Event Flood

**Isolate the Noisy Workspace:** As an immediate fix, move the noisy workspace to a separate TE version to prevent performance issues or throttling for neighboring workspaces.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -33,27 +33,27 @@ Under Controls, select **Alerts by Control Type**.

![Alerts by Control Type](/images/docs/guardrails/guides/hosting-guardrails/monitoring/workspace-health-check/guardrails-select-controls-alerts.png)

Filter for **Error** and **Invalid** states.
Select **Invalid** and **Error** From **State** filter dropdown.

![Apply Filter](/images/docs/guardrails/guides/hosting-guardrails/monitoring/workspace-health-check/guardrails-filter-error-invalid.png)

## Step 3: View Policy Alerts

In **Reports**, under **Policies**, select **Policy Values by State**.
In **Reports**, scroll down to `Policies` section, select **Policy Values by State** option.

![Alerts by Policy Values](/images/docs/guardrails/guides/hosting-guardrails/monitoring/workspace-health-check/guardrails-policy-values-by-state.png)

Filter for **Error** and **Invalid** states.
Select **Invalid** and **Error** From **State** filter dropdown.

![Apply Filter](/images/docs/guardrails/guides/hosting-guardrails/monitoring/workspace-health-check/filter-policy-error-invalid-state.png)

## Step 4: Resolving Errors and Optimizing Controls

Review the controls and errors currently in an error state and take the necessary actions.
*Review the controls and errors* currently in an error state and take the necessary actions.

If the error is due to policy misconfiguration, carefully adjust the settings and apply the changes as required. Ensure that all configurations align with the workspace's needs to resolve the issue effectively.
*If the error is due to policy misconfiguration*, carefully adjust the settings and apply the changes as required. Ensure that all configurations align with the workspace's needs to resolve the issue effectively.

For product-related issues, make sure to document and report them for further investigation.
*For product-related issues*, make sure to document and report them for further investigation.

Additionally, to maintain efficiency, resources or controls that are not a priority should be skipped to reduce noise and wastage.

Expand Down

0 comments on commit 7615a55

Please sign in to comment.