Working with Altair, ALCF researchers are able to more efficiently schedule workloads. The results are that researchers collectively can shave hundreds of hours off their time and make more scientific discoveries.
The U.S. Department of Energy funds Argonne National Laboratory to operate some of the fastest computers in the world. For decades, its systems have placed in the top 20 or higher on the Top500 list of top-performing supercomputers, published twice yearly.
Within the lab, the Argonne Leadership Computing Facility (ALCF) aims to use simulation, data analytics, and artificial intelligence to provide breakthroughs in science and engineering. To that end, the ALCF supports projects that cover many scientific disciplines, ranging from chemistry and biology to physics and materials science. Examples of the work conducted on the facility’s compute systems include modeling and simulation efforts to:
- Discover new materials for batteries
- Predict the impacts of global climate change
- Unravel the origins of the universe
- Develop renewable energy technologies.
One of the systems available to the scientific community is Polaris, which ranked 17th on the most recent Top500 list. The system provides researchers and developers with a powerful new platform to prepare applications and workloads for science in the exascale era. And soon to come is the Aurora exascale supercomputer. Aurora will theoretically deliver more than two exaflops of computing power, or more than 2 billion, billion calculations per second when it’s powered on.
Making such supercomputing capabilities available to researchers in a broad range of disciplines, the compute facility must accommodate a wide range of workloads and datasets. To put the volume of work into perspective, consider that last year, the ALCF supported 385 active projects, delivered 35 million node-hours of compute time, and supported 1,538 facility users, according to the lab.
Obviously, the demand for the facility is great. And as such, the facility strives to optimize the use of the installed systems. The challenges ALCF and other scientific supercomputing centers face in this area include the following:
- Workload management issues
- The ability to schedule and support simultaneous and concurrent workloads
- The need to optimize and manage thousands of node hours
Working with a technology partner
Supercomputing facilities have traditionally pioneered the use of newer technologies such as higher-performance processors, high-speed networking, high-performance distributed storage file systems, and more. The facilities have the expertise and resources to implement new technologies as they come along.
One area where the facility, like many HPC and supercomputing centers, can use help is in optimizing operations. They need to balance the many critical dimensions of modern infrastructure — from advanced scheduling for CPUs and GPUs to optimizing for software licenses, I/O, storage, and more.
There are certainly open-source tools that can help in these areas. But the facility opted to partner with Altair to use its Altair PBS Professional solution. The solution addresses the facility’s critical workload management and job scheduling issues. Compared to open-source solutions, the Altair solution provides enhanced offerings needed in the lab’s fast-paced, constantly changing production HPC environment.
To that end, PBS Professional is a workload manager designed to improve productivity, optimize utilization and efficiency, and simplify administration for clusters, clouds, and supercomputers. It accomplishes this by automating job scheduling, management, monitoring, and reporting.
PBS Professional accelerates job execution and selects optimal job placement across diverse, broadly distributed resources. It’s easy to create intelligent policies to manage distributed, mixed-vendor computing assets as a single, unified system. Tested to 50,000+ nodes, PBS Professional scales to support millions of cores with fast job dispatch and minimal latency.
Additionally, PBS Professional is a fast, powerful workload manager designed to improve productivity, optimize utilization and efficiency, and simplify administration for clusters, clouds, and supercomputers — from the biggest HPC workloads to millions of small, high-throughput jobs.
Argonne’s Polaris supercomputer is utilizing the technology to help scientists find ways to slash greenhouse gas emissions through research into fusion energy, better biofuels, and safer and more reliable next-generation nuclear reactors. Specifically, PBS Professional is helping to optimize and manage thousands of node hours simultaneously. (The Aurora exascale system will also utilize PBS Professional.)
By using PBS Professional on the HPC systems at ALCF, researchers are able to schedule simultaneous and concurrent workloads, which creates higher research throughputs without interruption. The results are that researchers collectively can shave hundreds of hours off their time and make more effective scientific discoveries.
A final word
These days, leading scientists advance their work using HPC to run increasingly more realistic and more detailed models and simulations of real-world systems. Facilities like those at the ALCF allow them to take their research to new levels.
As demand for compute capacity grows, ALCF is working with Altair to ensure the facility’s systems are highly utilized and workflows sped up by running on appropriate cores. The result is that more projects can be supported, jobs run faster, and scientists get results sooner. “PBS Professional allows researchers to drive scientific advancement at a significantly faster rate,” said Bill Allcock, ALCF director of operations, in a press release discussing the collaboration between Altair and ALCF.
Learn more about the benefits of PBS Professional at ALCF and in HPC environments by attending Altair’s Future.Industry 2023 conference, available now on demand. View the sessions for the HPC track here.