Diamond: Harnessing GPU Resources for Scientific Deep Learning

Abstract

Modern research computing cyberinfrastructure, such as ACCESS-CI and NAIRR Pilot, offers GPU resources across geographically distributed clusters to accommodate the increasing needs of scientific deep learning (DL) workloads. Even for high-performance computing (HPC) experts, configuring environments and managing DL workloads across supercomputers remain significant barriers. To address these obstacles, we present Diamond, an open-source platform to simplify and streamline the DL lifecycle on HPC. Diamond provides an intuitive graphical interface that abstracts system-level complexity, enabling users to develop, debug, and deploy DL models with minimal overhead. We identify several challenges in building such a platform, including portability, security, and usability, and propose effective architectural solutions to each. Notably, Diamond enables users to share and reuse DL workload environments across systems and collaborators, reducing redundant setup efforts. Experimental results demonstrate that Diamond reduces the time to first successful deployment by an average of 68%, compared to manual configuration with command lines. The Diamond service is available at https://diamondhpc.ai.

Publication
2025 IEEE International Conference on eScience (eScience)