Skip to content

1 HPC Archeticture

Zack Ramjan edited this page Oct 14, 2024 · 4 revisions

High Level Overview of architecture

  • Whats the point of an High Performance Computing cluster?

An analogy:

  • Lets compare it to a large hotel.
  • What if there was no reservation system and no front desk? everyone just walks to a room? Chaos.

To continue with our analogy:

  • headnode/submit nodes == the hotel front desk or online reservation site
  • compute nodes == The hotel rooms
  • submitting a job == asking for a room
    • How many days? how many beds? == how long the job will run, how many cores?
  • Showing up at the front desk and getting your room key == your job is ready to run
    • Maybe the hotel was busy and you had to wait for the room to be cleaned == maybe you had to wait for resources to free up
  • Checkout, you turn in your keys and leave the hotel == Your job has finished and you longer have access to the compute node anymore.

Lets review the parts of an HPC cluster.

cluster

Headnodes / submitnodes]

  • This is a node that you log into. In our hotel analogy above, it would be the lobby.
  • For shell or command line access, this would be submit001, submit002, submit003
  • For Web-based access using HPC Ondemand, you would log into https://ondemand.hpc.vai.org

Shell access represents the traditional approach to HPC and has been the status queue for the last 30+ years.

But within the last 4 years or so, due to the heavily funded Ondemand Project, web-based access has become very popular and seems to be available at most HPC centers.

Web Access is much easier and likely more comfortable to people without experience with linux or HPC. I can do most basic things. Shell on the other is more powerful, so if you find needs going beyond ondemand, you will need to use shell.

Worker nodes

http://hpc-dashboard.aws.vai.org/nodes.html

The various types of compute nodes (partitions) and their limits

  • quick - 8 cores per node, 48hr max job, 4 jobs per user, 2 nodes at a time total
  • short - 16 cores per node, 7 days max job time, 100 jobs per user, 4 nodes at a time total
  • long - 64 cores per node, 14 days max job time, 100 jobs per user, 4 nodes at a time total
  • bigmem - 128 cores per node, 14days max job time, 4 jobs per user, 2 nodes at a time total
  • gpu - 1 job per user, 1 node per user, 14 days max time

And private partitions that are only to be used by the groups that own them.

Job Scheduler

The system we use at VAI is called the Slurm Job Scheduler with HPC Ondemand web app interface