Skip to content

Commit

Permalink
Preparing for CUG 2024
Browse files Browse the repository at this point in the history
  • Loading branch information
alexlovelltroy committed Apr 25, 2024
1 parent fe0c26e commit 82e246f
Show file tree
Hide file tree
Showing 18 changed files with 754 additions and 2,814 deletions.
9 changes: 0 additions & 9 deletions config/_default/menus/menus.en.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,7 @@
identifier = "lorem"
url = "/tutorial/lorem/"

[[main]]
name = "Docs"
url = "/docs/"
# url = "/docs/1.0/prologue/introduction/"
weight = 10

[[main]]
name = "Guides"
url = "/guides/"
weight = 30

[[main]]
name = "Blog"
Expand Down
10 changes: 10 additions & 0 deletions content/docs/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: "Docs"
description: ""
summary: ""
date: 2023-09-07T16:12:03+02:00
lastmod: 2023-09-07T16:12:03+02:00
draft: false
toc: true
---

38 changes: 0 additions & 38 deletions content/docs/index.md

This file was deleted.

31 changes: 31 additions & 0 deletions content/docs/introduction/introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: "Introduction to OpenCHAMI"
description: "What is OpenCHAMI?"
summary: ""
date: 2024-03-07T16:12:03+02:00
lastmod: 2024-03-07T16:12:03+02:00
draft: false
toc: true
---

In 2023, a group of some of the largest HPC sites came together to invest in an open source future system manager that could bridge the worlds of cloud and HPC. They formed a consortium and established governance to guide the development of that system manager. At early board meetings, the team decided on core concepts and used them to establish the name of the project.

* **Open** - The community is more important than any included software and all parts of the solution should be freely available for anyone to build an HPC system or customize to meet the needs of their system or site.
* **Composable** - Rather than a fully integrated system, the goal of OpenCHAMI is to be modular. Not all sites or systems will need the same set of software, and it should be possible to replace software components as needed.
* **Heterogeneous** - Sites with multiple kinds of hardware, should be able to manage them all with the same resilient and scalable infrastructure. OpenCHAMI makes no design assumptions that force a single system image, single architecture, or even a single High Speed Interconnect that must be shared across all nodes.
* **Adaptable** - The community values constant evolution of system management software. This is true at the individual system scale where adaptability means resiliency and stability. It is also true as the industry evolves to embrace new technologies and the solution itself needs to adapt.
* **Management Infrastructure** - Describing the solution in this way narrows the scope of OpenCHAMI to activities in the management plane of HPC systems. Sysadmins need a stable base upon which they can choose the best Operating Systems and user interaction methods for their users.



## First steps

A core team at Los Alamos National Laboratory took the lead on software development. Their first goal was to strip back much of CSM to the bare minimum needed to boot ten nodes. Within a few months, they had established the core repositories and were able to share their progress at SC23 in Denver. The lightweight solution involved just a few microservices and support services and was able discover and boot ten nodes. [Learn more here...](https://github.com/OpenCHAMI/lanl-demo-sc23)

Following the initial demonstration, additional teams have been adding resoruces to the effort with their own goals. Work continues on using OpenCHAMI to manage clusters at the US National Laboratories. At the same time, sites are working to manage HPC systems from the public cloud with OpenCHAMI on GKE. Other sites are collaborating on managing multiple HPC clusters with the same OpenCHAMI installation.

In May of 2024, the team will showcase progress at the International Supercomputing Conference in Germany. Installation times, boot times, and overall scale have improved considerably in the last six months.

Join us as we plan the next steps in our journey!


27 changes: 0 additions & 27 deletions content/docs/reference/open_source.md

This file was deleted.

20 changes: 10 additions & 10 deletions content/docs/software/_index.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
title: "Software"
description: ""
summary: ""
date: 2024-03-07T16:12:03+02:00
lastmod: 2024-03-07T16:12:03+02:00
draft: false
weight: 999
toc: true
seo:
title: "" # custom title (optional)
description: "" # custom description (recommended)
canonical: "" # custom canonical URL (optional)
noindex: false # false (default) or true
weight: 50
toc: false
pinned: false
homepage: false
---

# OpenCHAMI Documentation
{{< callout context="tip" title="Did you know?" icon="rocket" >}}
The [OpenCHAMI quickstart](https://github.com/openchami/deployment-recipes/) has been used to boot over 600 compute nodes in about five minutes, including POST!.
{{< /callout >}}



46 changes: 46 additions & 0 deletions content/docs/software/architecture/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
title: "Architecture"
date: 2024-03-07T16:12:03+02:00
lastmod: 2024-03-07T16:12:03+02:00
draft: false
weight: 50
toc: false
pinned: false
homepage: false
---


## Philosophy

The OpenCHAMI archtectural concepts share a lot with the original UNIX concepts. Tools should do one thing well and provide useful inputs and outputs to interoperate with other tools. With tools that interact on the same system, UNIX pipes and plain text remain core to interoperability. For tools that interact across distributed systems and potentially across the internet, we add additional useful constratins.

* REST
* TLS
* JWT
* json/yaml

## Early design decisions

In establishing the governance and charter of OpenCHAMI, the board and technical steering committee made a few foundational decisions.

1. All development must be publically available through the [OpenCHAMI Github Organization](https://github.com/OpenCHAMI) including meeting notes and design discussions.
1. The software development effort must start with MIT-licensed microservices from the Cray System Manager(CSM) which was developed for the first Exascale Class Supercomputers.
1. Initial development must be focused on containerized microservices with REST APIS and cloud-like authentication/authorization.
1. The system must operate well for traditional, highly parallel, shared memory workloads.
1. The system must support new types of workloads that are being developed to support Machine Learning, Model Training, and Inference.
1. The system must support the evoloving concept of HPC multitenancy.

## Third Party Services

Most of the components found in a deployment of OpenCHAMI are not part of the OpenCHAMI project. Common open source-software like `dnsmasq` and `haproxy` have much larger developer and user communities, and they fulfill HPC needs without customization. There are also plenty of resources for teams that would prefer alternatives to each of the recommended third party applications.

Where OpenCHAMI microservices do need to be specific, we favor, but do not require, a set of common technologies.

## OpenCHAMI services tend to:

* be HTTPS Microservices
* run as containers
* be configured at runtime through flags and environment variables
* be based on go 1.21 and the wolfi base containers from chainguard
* leverage bearer tokens for decentralized authentication and authorization
* use go-chi with its robust middleware support
10 changes: 4 additions & 6 deletions content/docs/software/bss/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,9 @@ draft: false
weight: 800
url: "/docs/software/bss/"
toc: true
seo:
title: "" # custom title (optional)
description: "" # custom description (recommended)
canonical: "" # custom canonical URL (optional)
noindex: false # false (default) or true

---

# Boot Script Service
# Customized Boot Parameters for each Compute Node

Managing the distribution and configuration of operating systems in a heterogenous HPC environment requires matching specific system image(s) and boot configurations with the compute nodes that need them. BSS leverages OpenCHAMI's detailed inventory system to ensure each node recieves the kernel, initrd, and flags necessary to efficiently bring up the whole system. Changes in the inventory are reflected in real time through the boot service.
17 changes: 0 additions & 17 deletions content/docs/software/bss/introduction.md

This file was deleted.

13 changes: 13 additions & 0 deletions content/docs/software/cloud-init/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
title: "Cloud-Init: Standard node personalization for the cloud"
description: ""
summary: ""
date: 2024-03-21T00:00:00+00:00
lastmod: 2024-03-21T00:00:00+00:00
draft: false
weight: 800
toc: true
---


Cloud-Init is a defacto standard in the cloud world. When you launch a virtual machine through the cloud providers, you may specify simple identity information as well as scripts that will be executed as part of the initialization of the instance. In many cases, this is the only post-boot configuration required. We use it for the same reason in OpenCHAMI. Our cloud-init server provides identity and configuration information based on the contents of SMD and suitable for the cloud-init client which is included with most Linux distributions.
3 changes: 2 additions & 1 deletion content/docs/software/magellan/index.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
---
title: "Magellan: Redfish-based inventory discovery and management"
slug: "magellan"
description: ""
summary: ""
date: 2024-03-21T00:00:00+00:00
lastmod: 2024-03-21T00:00:00+00:00
draft: false
weight: 800
toc: true
url: "/docs/software/magellan/"

seo:
title: "" # custom title (optional)
description: "" # custom description (recommended)
Expand Down
10 changes: 3 additions & 7 deletions content/docs/software/smd/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,8 @@ lastmod: 2024-03-21T00:00:00+00:00
draft: false
weight: 800
toc: true
url: "/docs/software/smd/"
seo:
title: "" # custom title (optional)
description: "" # custom description (recommended)
canonical: "" # custom canonical URL (optional)
noindex: false # false (default) or true
---

# State Management Daemon
# State Management Daemon

The OpenCHAMI inventory database is a customized version of the State Management Database (SMD) from the Cray System Manager. It manages inventory information about the compute nodes and makes it accessible through an HTTP API that other microservices reference in the course of their work. While it generally serves data from memory, it uses Postgres for persistent storage.
14 changes: 14 additions & 0 deletions content/guides/deploy_openhpc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
title: "Deploy OpenHPC"
description: ""
summary: "Deploying Alma Linux with OpenHPC on the compute nodes"
date: 2023-09-07T16:04:48+02:00
lastmod: 2023-09-07T16:04:48+02:00
draft: false
weight: 810
toc: true
seo:
title: "" # custom title (optional)
description: "" # custom description (recommended)
canonical: "" # custom canonical URL (optional)
noindex: false # false (default) or true
---
Loading

0 comments on commit 82e246f

Please sign in to comment.