-
Notifications
You must be signed in to change notification settings - Fork 0
meeting public dashboard 2024 05 03
Kenneth Hoste edited this page Aug 9, 2024
·
3 revisions
- attending:
- SURF: Casper, Satish, Maksim, Paul (excused: Caspar)
- UGent: Kenneth, Lara
- RUG: Bob
- UiB: Richard
MultiXscale Task 5.2 description:
Task 5.2 - Monitoring and testing of the central shared software stack (28PM)
Leader: Surf (12PM); Partners: Ugent (3PM), Uib (4PM), RIJKSUNIGRON (5PM), Ub (4PM)
We will actively monitor and test the central shared software stack
to ensure software applications installed in it are working as intended,
in terms of functionality, correctness of produced results, and performance.
This will be done across the range of supported platforms,
guided by the support levels defined in Task 5.4.
For this task, we will employ the test suite that is developed in Task 1.3,
and set up performance monitoring of selected applications over time,
so we can identify performance regressions early on.
The underlying cause of problems that are identified through monitoring
and testing will be identified and mitigated, where needed in collaboration
with WP1.
We will develop a public dashboard to present current support status
of central software stack on current system architectures.
- (Maksim) what should be visualised?
- status of tests, time series (performance data)
- for which systems/architectures
- systems are SURF, AWS, Vega, Karolina
- (Paul) how are test runs triggered?
- currently via cron (daily)
- could also be event-based (change in software stack, fix being implemented, manual trigger, ...)
- (Maksim) how long does data need to be retained?
- ideally indefinitely
- amount of data being collected is pretty limited
- currently dozens of tests, on handful of systems
- could grow to 1000s of tests on dozens of systems
- hosting of public dashboard
- ideally using open source software
- could be hosted on AWS, should be easy to move somewhere else
- ELK stack (https://www.elastic.co/elastic-stack)
- should work well with ReFrame (via JSON sent to ES)
- ~50 lines of Python to ingest data coming from ReFrame
- maybe Grafana could be interesting too
- can talk directly to ES
- data coming in through REST API?
- with some form of authentication (token-based?)
- from personal accounts on a range of HPC systems
- step-wise plan
- some initial setup with data coming in from Snellius?
- iterative process
- also upload raw data to S3 bucket in AWS by means of backup
- frequent meetings?
- first look into first sprint
- another sync meeting mid June?
- channel in EESSI Slack
- in #testing channel?
- join via Slack link at https://www.eessi.io