Skip to content

Commit

Permalink
Metrics update with examples (#116)
Browse files Browse the repository at this point in the history
This updates the documentation on metrics and adds examples on how to fetch metrics.

New code example is tested using pytest.
  • Loading branch information
hoh authored Oct 2, 2024
1 parent a6194fb commit b2d515b
Show file tree
Hide file tree
Showing 2 changed files with 105 additions and 17 deletions.
118 changes: 101 additions & 17 deletions docs/nodes/reliability/metrics.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,56 @@
# Metrics

A program regularly measures the status and performance of the nodes, and publishes this data as POST messages on the network with the type `aleph-scoring-metrics`.
Metrics are measurements of the performance and reliability of the nodes.

A program measures every hour the status and performance of the nodes, and publishes this data messages on the aleph.im network.

This program sends multiple HTTP requests to each node in order to evaluate how well it behaves.

The measurement program is part of the open-source [aleph-scoring](https://github.com/aleph-im/aleph-scoring/) project.

## Method

The metrics program is deployed on a collection of servers on different continents in order to reduce geographical bias.

Every hour, the measurement program creates a random plan of when to connect to each node for measurements over the following hour. It then follows this plan, connecting to every node in the network over that hour.

The program connects to each node using a few different methods and measures the time taken to obtain a response for each measurement (latency).

- HTTP or HTTPS
- IPv4 and IPv6
- Ping (ICMP) requests

All durations are expressed in seconds (floating numbers).

When a test fails, the corresponding field is not included in the results.

At the end of the hour, the program publishes the results in JSON in the form of a [POST](../../protocol/object-types/posts.md) message with the type `aleph-scoring-metrics`.

Production metrics are signed by the address `0x4D52380D3191274a04846c89c069E6C3F2Ed94e4`.

> 🔗 See [the metrics messages on the exporer](https://explorer.aleph.im/messages?showAdvancedFilters=1&channels=aleph-scoring&page=1&sender=0x4D52380D3191274a04846c89c069E6C3F2Ed94e4)
## Common metrics

Some metrics are common to all node types:

1. **Software version**: We compare the version of the node to the latest version available. Node operators have a grace period to update their node to the latest release.
2. **Automatic System Number** (ASN): Gives a rough estimate of where the server is located. This helps us score the decentralization of the nodes.
1. **Software version** (`version`): We compare the version of the node to the latest version available. Node operators have a grace period to update their node to the latest release.
2. **Automatic System Number** (`asn`): Gives a rough estimate of where the server is located. This helps us score the decentralization of the nodes. The `as_name` field contains the name.

## Metrics for Core Channel Nodes

1. **Base latency**: The base latency to respond to a request, measured by calling `/api/v0/info/public.json` (no processing on that page).
2. **Metrics latency**: The latency to fetch public node metrics, measured by calling `/metrics.json`
1. **Base latency** (`base_latency`): The time to respond to a simple request, measured by calling `/api/v0/info/public.json` (no processing on that page).
2. **Metrics latency** (`metrics_latency`): The time to fetch public node metrics, measured by calling `/metrics.json`
[//]: # (3. The following variables from the metrics.json response:)
[//]: # ( a. `pyaleph_status_sync_pending_txs_total`)
[//]: # ( b. `pyaleph_status_sync_pending_messages_total`)
[//]: # ( c. `pyaleph_status_chain_eth_height_remaining_total`)
3. **Aggregate latency**: The latency to fetch a large aggregate, measured by calling `/api/v0/aggregates/0xa1B3bb7d2332383D96b7796B908fB7f7F3c2Be10.json?keys=corechannel&limit=50`.
4. **File download latency**: The latency to fetch a 6.7 kB file, measured by calling `/api/v0/storage/raw/50645d4ccfddb7540e7bb17ffa5609ec8a980e588e233f0e2c4451f6f9da6ebd`

3. **Aggregate latency** (`aggregate_latency`): The time to fetch a large aggregate, measured by calling `/api/v0/aggregates/0xa1B3bb7d2332383D96b7796B908fB7f7F3c2Be10.json?keys=corechannel&limit=50`. Accesses the database.
4. **File download latency** (`file_download_latency`): The time to fetch a 6.7 kB file, measured by calling `/api/v0/storage/raw/50645d4ccfddb7540e7bb17ffa5609ec8a980e588e233f0e2c4451f6f9da6ebd`. Accesses the storage
5. **Pending messages** (`pending_messages`): The number of messages in queue to be processed. Should be low except on new nodes still syncing.
6. **Pending transactions** (`txs_total`): The number of archives to be fetched from IPFS and processed. Should be very low except on new nodes still syncing.
7. **Ethereum height remaining** (`eth_height_remaining`): Number of [blocks](https://ethereum.org/en/developers/docs/blocks/) available on Ethereum that are newer than the newest archive processed.

Metrics are only valid if the HTTP response code is a success.

The metrics for a CCN have the following form:
Expand All @@ -45,11 +74,15 @@ The metrics for a CCN have the following form:

## Metrics for Compute Resource Nodes

1. **Base latency**: The base latency to respond to a request, measured by calling `/about/login`. Should return HTTP code `401 Unauthorized`.
2. **Diagnostic VM latency**: The latency to call a common user program, measured by calling `/vm/67705389842a0a1b95eaa408b009741027964edc805997475e95c505d642edd8`
3. **Full check latency**: The latency to run a collection of checks on the node, measured by calling `/status/check/fastapi`.
All measurements for Compute Resource Nodes are done in [IPv6](https://en.wikipedia.org/wiki/IPv6).

1. **Base latency** (`base_latency`): The time to respond to a simple request, measured by calling `/about/login` (no processing on that endpoint). Should return HTTP code `401 Unauthorized`.
2. **Diagnostic VM latency** (`diagnostic_vm_latency`): The time to call a common user program and get a response, measured by calling `/vm/67705389842a0a1b95eaa408b009741027964edc805997475e95c505d642edd8`
3. **Full check latency** (`full_check_latency`): The time to run a collection of checks on the node and get a response, measured by calling `/status/check/fastapi`.
4. **Diagnostic VM Ping latency** (`diagnostic_vm_ping_latency`): The time returned by an [ICMP Ping](https://en.wikipedia.org/wiki/Ping_(networking_utility)) to the diagnostic virtual machine running on the node. This metric is only present if the VM is available via IPv6 (VM Egress IPv6).
5. **Base latency Ipv4** (`base_latency_ipv4`): The time same as `base_latency` above but using IPv4 instead of IPv6.

The metrics for a CRN have the following form
The metrics for a CRN have the following form:
```json
{
"measured_at":1680715253.669524,
Expand All @@ -59,13 +92,64 @@ The metrics for a CRN have the following form
"as_name":"INTERNET-SERVICE-PROVIDER, AD",
"base_latency":0.9623174667358398,
"diagnostic_vm_latency":0.06729602813720703,
"full_check_latency":0.5257446765899658
"full_check_latency":0.5257446765899658,
"diagnostic_vm_ping_latency": 0.148196"
}
```

## Publishing
## Analyzing

The [scores](../../nodes/reliability/scores.md) are computed based on the metrics, in a reproducible manner.

Metrics messages can be found:

### On the Message Explorer

[https://explorer.aleph.im/messages?showAdvancedFilters=1&channels=aleph-scoring&sender=0x4D52380D3191274a04846c89c069E6C3F2Ed94e4](https://explorer.aleph.im/messages?showAdvancedFilters=1&channels=aleph-scoring&sender=0x4D52380D3191274a04846c89c069E6C3F2Ed94e4)

### Using the Python SDK

The [Python SDK](../../../libraries/python-sdk/posts/query/) provides helpers to fetch the relevant messages.
```python
import asyncio
from datetime import UTC, datetime, timedelta

from aleph_message.models import PostMessage

from aleph.sdk.client import AlephHttpClient
from aleph.sdk.query.filters import PostFilter


async def get_metrics():
async with AlephHttpClient() as client:
response = await client.get_posts(
post_filter=PostFilter(
types=["aleph-network-metrics"],
addresses=["0x4D52380D3191274a04846c89c069E6C3F2Ed94e4"],
channels=["aleph-scoring"],
start_date=datetime.now(tz=UTC) - timedelta(hours=4),
end_date=datetime.now(tz=UTC),
)
)
return response.posts


messages = asyncio.run(get_metrics())
message: PostMessage
for message in messages:
print(message.item_hash)
```

### Using the HTTP API

```shell
curl "https://official.aleph.cloud/api/v0/messages.json?" \
"addresses=0x4D52380D3191274a04846c89c069E6C3F2Ed94e4&" \
"channels=aleph-scoring&" \
"startDate=1727775567&" \
"endDate=1727861984"
```

Metrics are published as a POST message on aleph.im, with the type `aleph-scoring-metrics`.
### Using the node metrics API

You can [find the metrics on the aleph.im Explorer](
https://explorer.aleph.im/address/ETH/0x4D52380D3191274a04846c89c069E6C3F2Ed94e4).
The [node metrics API](https://docs.aleph.im/nodes/reliability/monitoring/#node-metrics) provides a convenient way to obtain the last two weeks of metrics for a specific node instead of extracting the data from the metrics messages.
4 changes: 4 additions & 0 deletions test/python_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,7 @@
@pytest.mark.parametrize("fpath", PYTHON_CODE_DIRECTORY.glob("**/*.md"), ids=str)
def test_run_python_code(fpath):
check_md_file(fpath=fpath)


def test_run_python_code_metrics():
check_md_file(fpath=Path(__file__).parent / "../docs/nodes/reliability/metrics.md")

0 comments on commit b2d515b

Please sign in to comment.