Releases: aws/aws-parallelcluster
AWS ParallelCluster v3.7.1
We're excited to announce the release of AWS ParallelCluster 3.7.1
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
CHANGES
- Upgrade Slurm to 23.02.5 (from 23.02.4).
- Upgrade Pmix to 4.2.6 (from 3.2.3).
- Upgrade libjwt to 1.15.3 (from 1.12.0).
- Upgrade EFA installer to
1.26.1
, fixing RDMA writedata issue in P5.- Efa-driver:
efa-2.5.0-1
- Efa-config:
efa-config-1.15-1
- Efa-profile:
efa-profile-1.5-1
- Libfabric-aws:
libfabric-aws-1.18.2-1
- Rdma-core:
rdma-core-46.0-1
- Open MPI:
openmpi40-aws-4.1.5-4
- Efa-driver:
AWS ParallelCluster v3.7.0
We're excited to announce the release of AWS ParallelCluster 3.7.0
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
ENHANCEMENTS
- Add support for Ubuntu 22. RSA keys are not supported by default. See this page.
- Add support for login nodes.
- Add support to mount existing Amazon File Cache as shared storage.
- Allow configuration of static and dynamic node priorities in Slurm compute resources via the ParallelCluster configuration YAML file.
- Add a queue-level parameter (
JobExclusiveAllocation
) to ensure nodes in the partition are exclusively allocated to a single job at any given time. - Allow overriding the aws-parallelcluster-node package at cluster creation and update time (only on the head node during update). Useful for development purposes only.
- Allow memory-based scheduling when multiple instance types are specified for a Slurm Compute Resource.
- Avoid starting the NFS server on compute nodes.
CHANGES
- Deprecate Ubuntu 18.
- Upgrade Slurm to version 23.02.4.
- Update the default root volume size to 40 GB to account for limits on Centos 7.
- Upgrade NVIDIA driver to version 535.54.03.
- Upgrade CUDA library to version 12.2.0.
- Upgrade NVIDIA Fabric manager to nvidia-fabricmanager-535.
- Upgrade NICE DCV to version 2023.0-15487.
- server: 2023.0.15487-1
- xdcv: 2023.0.551-1
- gl: 2023.0.1039-1
- web_viewer: 2023.0.15487-1
- Upgrade EFA installer to 1.25.1.
- Efa-driver: efa-2.5.0-1
- Efa-config: efa-config-1.15-1
- Efa-profile: efa-profile-1.5-1
- Libfabric-aws: libfabric-aws-1.18.1-1
- Rdma-core: rdma-core-46.0-1
- Open MPI: openmpi40-aws-4.1.5-4
- Upgrade ARM PL to version 23.04.1 for Ubuntu 22.04 only.
- Assign Slurm dynamic nodes a priority (weight) of 1000 by default. This allows Slurm to prioritize idle static nodes over idle dynamic ones.
- Change the default value of
Imds/ImdsSupport
from v1.0 to v2.0. - Make
aws-parallelcluster-node
daemons handle only ParallelCluster-managed Slurm partitions. - Create a Slurm
partition-nodelist
mapping JSON file to be used by the node package daemons to recognize PC-managed Slurm partitions and nodelists. - Increase EFS-utils watchdog poll interval to 10 seconds. Note: This change is meaningful only if EncryptionInTransit is set to true, because watchdog does not run otherwise.
BUG FIXES
- Add validation to
ScaledownIdletime
value, to prevent setting a value lower than-1
. - Fix issue causing dangling IAM policies to be created when creating ParallelCluster CloudFormation custom resource provider with CustomLambdaRole.
- Fix an issue that was causing misalignment of compute nodes DNS name on instances with multiple network interfaces,
when usingSlurmSettings/Dns/UseEc2Hostnames
equals toTrue
. - Fix cluster creation failure with Ubuntu Deep Learning AMI on GPU instances and DCV enabled.
AWS ParallelCluster v3.6.1
We're excited to announce the release of AWS ParallelCluster 3.6.1
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
ENHANCEMENTS
- Add support for Slurm accounting in US isolated regions.
CHANGES
- Avoid duplication of nodes seen by
clustermgtd
if compute nodes are added to multiple Slurm partitions. - ParallelCluster AMI for US isolated regions are now vended with preconfigured CA certificates to speed up node bootstrap.
- Replace
nvidia-persistenced
service withparallelcluster_nvidia
service to avoid conflicts with DLAMI.
BUG FIXES
- Remove hardcoding of root volume device name (
/dev/sda1
and/dev/xvda
) and retrieve it from the AMI(s) used duringcreate-cluster
. - Fix cluster creation failure when using CloudFormation custom resource with
ElasticIp
set toTrue
. - Fix cluster creation/update failure when using CloudFormation custom resource with large configuration files.
- Fix an issue that was preventing
ptrace
protection from being disabled on Ubuntu and was not allowing Cross Memory Attach (CMA) in libfabric. - Fix fast insufficient capacity fail-over logic when using multiple instance types and no instances are returned.
AWS ParallelCluster v3.6.0
We're excited to announce the release of AWS ParallelCluster 3.6.0
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
ENHANCEMENTS
- Add support for RHEL8.7.
- Add a CloudFormation custom resource for creating and managing clusters from CloudFormation.
- Add support for customizing the cluster Slurm configuration via the ParallelCluster configuration YAML file.
- Build Slurm with support for LUA.
- Increase the limit on the maximum number of queues per cluster from 10 to 50. Compute resources can be distributed flexibly across the various queues as long as the cluster contains a maximum of 50 compute resources.
- Allow to specify a sequence of multiple custom actions scripts per event for
OnNodeStart
,OnNodeConfigured
andOnNodeUpdated
parameters. - Add new configuration section
HealthChecks/Gpu
for enabling the GPU Health Check in the compute node before job execution. - Add support for
Tags
in theSlurmQueues
andSlurmQueues/ComputeResources
section. - Add support for
DetailedMonitoring
in theMonitoring
section. - Add
mem_used_percent
anddisk_used_percent
metrics for head node memory and root volume disk utilization tracking on the ParallelCluster CloudWatch dashboard, and set up alarms for monitoring these metrics. - Add log rotation support for ParallelCluster managed logs.
- Track common errors of compute nodes and longest dynamic node idle time on Cloudwatch Dashboard.
- Enforce the DCV Authenticator Server to use at least
TLS-1.2
protocol when creating the SSL Socket. - Install NVIDIA Data Center GPU Manager (DCGM) package on all supported OSes except for aarch64
centos7
andalinux2
. - Load kernel module nvidia-uvm by default to provide Unified Virtual Memory (UVM) functionality to the CUDA driver.
- Install NVIDIA Persistence Daemon as a system service.
CHANGES
- Note 3.6 will be the last release to include support for Ubuntu 18. Subsequent releases will only support Ubuntu from version 20.
- Upgrade Slurm to version 23.02.2.
- Upgrade munge to version 0.5.15.
- Set Slurm default
TreeWidth
to 30. - Set Slurm prolog and epilog configurations to target a directory,
/opt/slurm/etc/scripts/prolog.d/
and/opt/slurm/etc/scripts/epilog.d/
respectively. - Set Slurm
BatchStartTimeout
to 3 minutes so to allow max 3 minutes Prolog execution during compute node registration. - Increase the default
RetentionInDays
of CloudWatch logs from 14 to 180 days. - Upgrade EFA installer to
1.22.1
- Dkms :
2.8.3-2
- Efa-driver:
efa-2.1.1g
- Efa-config:
efa-config-1.13-1
- Efa-profile:
efa-profile-1.5-1
- Libfabric-aws:
libfabric-aws-1.17.1-1
- Rdma-core:
rdma-core-43.0-1
- Open MPI:
openmpi40-aws-4.1.5-1
- Dkms :
- Upgrade Lustre client version to 2.12 on Amazon Linux 2 (same version available on Ubuntu 20.04, 18.04 and CentOS >= 7.7).
- Upgrade Lustre client version to 2.10.8 on CentOS 7.6.
- Upgrade NVIDIA driver to version 470.182.03.
- Upgrade NVIDIA Fabric Manager to version 470.182.03.
- Upgrade NVIDIA CUDA Toolkit to version 11.8.0.
- Upgrade NVIDIA CUDA sample to version 11.8.0.
- Upgrade Intel MPI Library to 2021.9.0.43482.
- Upgrade NICE DCV to version
2023.0-15022
.- server:
2023.0.15022-1
- xdcv:
2023.0.547-1
- gl:
2023.0.1027-1
- web_viewer:
2023.0.15022-1
- server:
- Upgrade
aws-cfn-bootstrap
to version 2.0-24. - Upgrade image used by CodeBuild environment when building container images for AWS Batch clusters, from
aws/codebuild/amazonlinux2-x86_64-standard:3.0
toaws/codebuild/amazonlinux2-x86_64-standard:4.0
and from
aws/codebuild/amazonlinux2-aarch64-standard:1.0
toaws/codebuild/amazonlinux2-aarch64-standard:2.0
. - OpenSSL version 1.1.1 or later is required for ParallelCluster CLI due to a change in urllib3 2.0. Using an older OpenSSL will trigger an
ImportError
when executing apcluster
command.
BUG FIXES
- Fix EFS, FSx network security groups validators to avoid reporting false errors.
- Fix missing tagging of resources created by ImageBuilder during the
build-image
operation. - Fix Update policy for
MaxCount
to always perform numerical comparisons on MaxCount property. - Fix an issue that was causing misalignment of compute nodes IP on instances with multiple network interfaces.
- Fix replacement of
StoragePass
inslurm_parallelcluster_slurmdbd.conf
when a queue parameter update is performed and the Slurm accounting configurations are not updated. - Fix issue causing
cfn-hup
daemon to fail when it gets restarted. - Fix issue causing dangling security groups to be created when creating a cluster with an existing EFS.
- Fix issue causing NVIDIA GPU compute nodes not to resume correctly after executing an
scontrol reboot
command.
AWS ParallelCluster v3.5.1
We're excited to announce the release of AWS ParallelCluster 3.5.1
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
ENHANCEMENTS
- Add a new way to distribute ParallelCluster as a self-contained executable shipped with a dedicated installer.
- Add support for US isolated region us-isob-east-1.
CHANGES
- Upgrade EFA installer to
1.22.0
- Efa-driver:
efa-2.1.1g
- Efa-config:
efa-config-1.13-1
- Efa-profile:
efa-profile-1.5-1
- Libfabric-aws:
libfabric-aws-1.17.0-1
- Rdma-core:
rdma-core-43.0-1
- Open MPI:
openmpi40-aws-4.1.5-1
- Efa-driver:
- Upgrade NICE DCV to version
2022.2-14521
.- server:
2022.2.14521-1
- xdcv:
2022.2.519-1
- gl:
2022.2.1012-1
- web_viewer:
2022.2.14521-1
- server:
BUG FIXES
- Fix update cluster to remove shared EBS volumes can potentially cause node launching failures if
MountDir
match the same pattern in/etc/exports
. - Fix for compute_console_output log file being truncated at every clustermgtd iteration.
AWS ParallelCluster v3.5.0
We're excited to announce the release of AWS ParallelCluster 3.5.0
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
ENHANCEMENTS
- Add official versioned ParallelCluster policies in a CloudFormation template to allow customers to easily reference them in their workloads.
- Add a Python library to allow customers to use ParallelCluster functionalities in their own code.
- Add logging of compute node console output to CloudWatch on compute node bootstrap failure.
- Add failures field containing failure code and reason to
describe-cluster
output when cluster creation fails.
CHANGES
- Upgrade Slurm to version 22.05.8.
- Make Slurm controller logs more verbose and enable additional logging for the Slurm power save plugin.
- Upgrade EFA installer to
1.21.0
- Efa-driver:
efa-2.1.1-1
- Efa-config:
efa-config-1.12-1
- Efa-profile:
efa-profile-1.5-1
- Libfabric-aws:
libfabric-aws-1.16.1amzn3.0-1
- Rdma-core:
rdma-core-43.0-1
- Open MPI:
openmpi40-aws-4.1.4-3
- Efa-driver:
BUG FIXES
- Fix cluster DB creation by verifying the cluster name is no longer than 40 characters when Slurm accounting is enabled.
- Fix an issue in clustermgtd that caused compute nodes rebooted via Slurm to be replaced if the EC2 instance status checks fail.
- Fix an issue where compute nodes could not launch with capacity reservations shared by other accounts because of a wrong IAM policy on head node.
- Fix an issue where custom AMI creation failed in Ubuntu 20.04 on MySQL packages installation.
- Fix an issue where pcluster configure command failed when the account had no IPv4 CIDR subnet.
AWS ParallelCluster v3.4.1
We're excited to announce the release of AWS ParallelCluster 3.4.1
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
BUG FIXES
- Fix an issue with the Slurm scheduler that might incorrectly apply updates to its internal registry of compute nodes. This might result in EC2 instances to become inaccessible or backed by an incorrect instance type.
AWS ParallelCluster v3.4.0
We're excited to announce the release of AWS ParallelCluster 3.4.0
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
ENHANCEMENTS
- Add support for launching nodes across multiple availability zones to increase capacity availability.
- Add support for specifying multiple subnets for each queue to increase capacity availability.
- Add new configuration parameter in
Iam/ResourcePrefix
to specify a prefix for path and name of IAM resources created by ParallelCluster - Add new configuration section
DeploySettings/LambdaFunctionsVpcConfig
for specifying the Vpc config used by ParallelCluster Lambda Functions. - Add possibility to specify a custom script to be executed in the head node during the update of the cluster. The script can be specified with
OnNodeUpdated
parameter when using Slurm as scheduler.
CHANGES
- Remove creation of EFS mount targets for existing FS.
- Mount EFS file systems using amazon-efs-utils. EFS files systems can be mounted using in-transit encryption and IAM authorized user.
- Install stunnel 5.67 on CentOS7 and Ubuntu to support EFS in-transit encryption.
- Upgrade EFA installer to
1.20.0
- Efa-driver:
efa-2.1
- Efa-config:
efa-config-1.11-1
- Efa-profile:
efa-profile-1.5-1
- Libfabric-aws:
libfabric-aws-1.16.1
- Rdma-core:
rdma-core-43.0-2
- Open MPI:
openmpi40-aws-4.1.4-3
- Efa-driver:
- Upgrade Slurm to version 22.05.7.
AWS ParallelCluster v2.11.9
We're excited to announce the release of AWS ParallelCluster 2.11.9
Upgrade
How to upgrade?
sudo pip install aws-parallelcluster==2.11.9
BUG FIXES
- Prevent updating
vpc_security_group_id
when a managed FSx for Lustre file system is configured in the cluster.
Doing so would result in file system deletion and potential data loss.
AWS ParallelCluster v3.3.1
We're excited to announce the release of AWS ParallelCluster 3.3.1
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
CHANGES
- Allow to use official product AMIs even after the two years EC2 deprecation time.
- Increase memory size of ParallelCluster API Lambda to 2048 in order to reduce cold start penalty and avoid timeouts.
BUG FIXES
- Prevent managed FSx for Lustre file systems to be replaced during a cluster update avoiding to support changes on the compute fleet subnet id.
- Apply the
DeletionPolicy
defined on shared storages also during the cluster update operations.