Skip to content

Releases: GoogleCloudPlatform/cluster-toolkit

v1.46.1: Fix cloud rdma ofi tunables always being set

11 Feb 01:59
c491a4a
Compare
Choose a tag to compare

What's Changed

Bug fixes 🐞

  • Switch ofi startup script to not run automatically by @abbas1902 in #3659

Full Changelog: v1.46.0...v1.46.1

Release v1.46.0

07 Feb 00:49
bb1ddad
Compare
Choose a tag to compare

Highlights:

  • Kueue becomes the officially supported workload scheduler for A3U.
  • New blueprints added for A3U (GKE/GCS) as well as H4d VMs and Slurm examples.
  • SlurmGCP module enhanced with advanced machine features/plugins deprecated.

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Version Updates ⏫

New Contributors

Full Changelog: v1.45.1...v1.46.0

v1.45.1 Fix XPK issues running glcuster in container environments

22 Jan 18:13
a9dd634
Compare
Choose a tag to compare

What's Changed

Bug fixes 🐞

Full Changelog: v1.45.0...v1.45.1

Release v1.45.0

15 Jan 23:54
79299a1
Compare
Choose a tag to compare

Highlights:

  • A3 Ultra GKE blueprints updated to use Kueue 0.10.0 and Jobset 0.7.2 which are now supported.
  • Module improvements to support GKE cluster deletion protection, default node pools with shielded instances, latest GKE version in Rapid channel for A3 Ultra clusters, configurable upgrade settings for node pools and managed hyperdisk support.
  • Example for running NVIDIA NeMo on a3-ultragpu-8g Slurm clusters

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Bug fixes 🐞

New Contributors

Full Changelog: v1.44.2...v1.45.0

v1.44.2: Fix for Slurm autoscaler support for future reservations

09 Jan 00:21
484da6e
Compare
Choose a tag to compare

What's Changed

Bug fixes 🐞

  • Hotfix: Slurm autoscaler support for future reservations by @tpdownes in #3508

Full Changelog: v1.44.1...v1.44.2

Release v1.44.1: Support for a3-ultragpu-8g VMs and GKE, Slurm clusters

30 Dec 23:36
346d015
Compare
Choose a tag to compare

Release notes v1.44.1

This release announces Toolkit support for the new A3 Ultra machine type from Google Cloud. This machine type includes 8 NVIDIA H200 GPUs each with dedicated CX-7 networking with RDMA support via RoCE.

The release includes 4 blueprints that maximize performance for the machine type:

  1. A simple Slurm blueprint provisioning A3 Ultra compute nodes with a shared Filestore /home
  2. A GKE blueprint that provisions an A3 Ultra compute node pool
  3. An advanced Slurm blueprint that additionally mounts a GCS bucket with performance-optimized caching settings for I/O and checkpointing.
  4. A blueprint that provisions A3 Ultra compute nodes as VM instances (no scheduler) with RDMA networking

Example solutions using NCCL are provided for blueprints running under a scheduler.

v1.44.0: Future Reservations in Slurm, Topology Aware GKE, Expanded GPU RDMA Support

19 Dec 22:55
6a19416
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Version Updates ⏫

Bug fixes 🐞

Full Changelog: v1.43.1...v1.44.0

v1.43.1: Patch version bump in OFE

12 Dec 20:02
0a8385b
Compare
Choose a tag to compare

What's Changed

Version Updates ⏫

  • Bump django from 4.2.16 to 4.2.17 in /community/front-end/ofe by @dependabot in #3358

Full Changelog: v1.43.0...v1.43.1

v1.43.0: GKE and networking enhancements

05 Dec 06:57
7ca11fc
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Bug fixes 🐞

  • Revert "update a3 machines local ssd to use nvme instead of scsi for better performance" by @chengcongdu in #3272
  • remove GKE reservation validation for local ssd NVMe/CSCI interface by @chengcongdu in #3281

New Contributors

Full Changelog: v1.42.0...v1.43.0

v1.42.0: Filestore deletion protection, GCP maintenance as Slurm job, Docker daemon configuration

20 Nov 19:27
1a1e22a
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

  • Refactor mount/mode setting for local SSD RAID by @tpdownes in #3214
  • Fix a bug where try was hiding extraction of gpu driver version by @ankitkinra in #3257
  • Fix the gpu_installation_config default for case where no customer input by @ankitkinra in #3259
  • SlurmGCP. Fix bug that prevents resourcePolicies clean up. by @mr0re1 in #3266

New Contributors

Full Changelog: v1.41.0...v1.42.0