In this interview, we discuss next-gen HPC challenges, including software license management and the complexity of modern HPC workflows, and how to deal with them to ensure high utilization and a high ROI.
Making efficient use of high-performance computing (HPC) capabilities, both on-premises and in the cloud, is a complex endeavor. Issues like workload scheduling, license management, cost control, and more come into play. In advance of Altair’s Future.Industry 2023, RTInsights sat down with Dr. Rosemary Francis, chief scientist, HPC at Altair, to discuss the challenges of using HPC for semiconductor design and other applications, where cloud fits in, and how Altair can help.
Here is a summary of our conversation.
RTInsights: What does exascale computing bring to semiconductor design and other industrial segments?
Francis: Exascale computing means different things to different industries and to different people. Certainly, exascale within the research computing sector means single machines sitting in the top 500 or top 10 based on published benchmark results. We’re talking large numbers of cores delivering exaflops of performance, all configured as one machine. Those kinds of machines are essential to many different segments, the most notable being weather forecasting. As we need to simulate climate destruction in increasing levels of detail to predict natural disasters, having exascale machines is absolutely essential.
That’s an example where you really do use one cluster for one simulation. There are other related areas where you would use a whole exascale machine for one compute, for example, when doing very large-scale astronomy simulations or geothermal simulations. In the energy sector, exascale is very important as well.
Outside of those applications, exascale is not important for running a single application. It’s more about having a large amount of resources available to run the very many large-scale applications that are needed by an organization. It is more efficient to have those configured as a single cluster rather than having lots of separate compute systems that need to be managed separately. That can be a real headache for system administrators. And it can lead to low utilization and, ultimately, a poor return on investment.
When you look at semiconductor design, it is different from the other HPC sectors. Semiconductor design has quite different workloads. They tend to be higher-throughput workloads rather than very wide MPI jobs. It’s not the only sector that runs high-throughput workloads, but it is a sector where high-throughput workloads dominate. Instead of having one single application that runs across many compute nodes for many hours, days, or even weeks, the semiconductor industry tends to have very short workloads that run for minutes or even seconds on a single core. These smaller applications are run in parallel on an enormous scale. So the need for compute capacity is increasing. However, it doesn’t have to be in the form of exascale machines; it could be through smaller clusters with smart resource sharing.
Now, where that becomes interesting is in license management. License management is the main challenge in semiconductor design. Rather than being limited by compute or memory access, license access seems to be the limiting factor when it comes to running semiconductor workloads. So, managing that pool of licenses across multiple clusters in an exascale environment is a challenge.
We do have products that solve that problem. One of the things that we’ve done in Altair Accelerator is to put licensing first. License-first scheduling, in combination with Rapid Scaling cloud bursting, allows you to deliver cloud infrastructure for the semiconductor industry while making sure that you’re utilizing your licenses effectively and you’re not spinning up compute resources that you don’t need.
RTInsights: What are the challenges when trying to efficiently manage and schedule workloads in EDA?
Francis: In EDA, there are, maybe, three big challenges. As I already mentioned, making the best use of semiconductor licenses is the first challenge. The next challenge is the very large number of very high-throughput jobs. It’s not uncommon for semiconductor customers to be running millions of jobs per day, whereas other large HPC centers might be running millions of jobs per year. It really is a different scheduling challenge.
The third challenge, which is quite unique to the semiconductor industry, is the enormous complexity of the workflows. Semiconductor workflows have many, many stages involving many tools, and those stages are iterated upon, run over and over again, trying to get the recipe for constructing that chip design.
Traditional non-EDA workflows can be complex, but they tend to run from start to end, and you often have make-clean steps within that that are quite costly, where you delete a lot of the intermediate results and then remake things from the beginning. In the semiconductor design industry, that’s just not practical. It can take hours to generate an intermediate result, and then you might want to iterate on one stage of that workflow for many days or even weeks to tune the results.
We have a product called Altair FlowTracer, which makes it practical to run those very complex semiconductor workflows. Before customers come to us and before they adopt FlowTracer, they often have heavyweight scripted environments that have been built up over many years. It can take even a senior engineer months or even years to get a grip on that scripted workflow when they’re new to an organization. FlowTracer makes it easy to construct the workflow with all of its dependencies. It then makes it easy to run and rerun those workflows, diagnosing any failures along the way and iterating on results so that even junior engineers can learn to run workflows extremely quickly.
Obviously, scheduling those workflows is a challenge. You have to make sure that, for example, if you’re cloud bursting, you don’t burst to the cloud if you’re going to run out of licenses. It seems obvious that you wouldn’t want to spin up machines in the cloud if you’re not going to have the licenses to run the tools. However, automating that efficiently so that you spin up exactly the right resources that you need and nothing more is a very complicated task across thousands of workflows and millions of tasks. That’s something that Altair Accelerator has been designed to make straightforward.
Customers typically see enormous cost savings when they adopt Accelerator because of this license-first approach. We’ve got a case study that’s been published recently from Annapurna Labs. They have adopted Accelerator for their chip design. They’re owned by Amazon, so they use the cloud exclusively for their chip design, and they’ve seen cost savings of around 50%, which is just mind-blowing when you think of the scale that they’re running at.
RTInsights: What are the challenges when trying to optimize other HPC environments?
I’m the product manager for both Altair PBS Professional and Altair Grid Engine workload managers. Although they are sometimes used for semiconductor design, these workload managers have a much broader HPC focus. PBS has been owned by Altair for more than 20 years and is deeply integrated with our industrial stimulation and manufacturing product range, our Altair HyperWorks suite. We acquired the Grid Engine product just over two years ago. It has a broad range of customers, particularly a large number of customers in the life sciences sector and also in banking and fintech.
Both of those products offer enormous flexibility and configurability in what they can do to make sure that an HPC center can see the best return on its investment. It’s often easy in a very complex environment to see utilization drop in HPC due to poorly configured infrastructure. When you consider the cost of ownership of an HPC system, that is enormously expensive. So, the very high utilization that you’re able to achieve with Altair workload managers is where the value really lies. No two HPC environments are the same, so flexibility in the solution and the expertise of our support team are vital in delivering a solution that aligns with our customers’ business needs.
For example, in weather forecasting, there are some workloads that really must run on a schedule, no matter what. They also mix very wide jobs with very short, high-throughput jobs, which can also be a challenge for scheduling. Many of our customers with this pattern of jobs layer our solutions and employ PBS Professional for the wide jobs, with Accelerator Plus sitting on top for the high-throughput workloads. This capability was why NCAR chose Altair over their older open-source solution.
RTInsights: Where does the cloud come in with these modern environments?
Francis: There’s been a lot written in HPC about on-premises versus the cloud. And there will always be room for large-scale HPC centers; the cost of ownership makes sense for customers who know they’re going to have the workloads to run in it. But equally, very few centers know exactly what they’re going to run over the next year, three years, or five years. There will be some unforeseen projects and short-lived workloads where you just need a bit of extra compute. That is where the cloud can be a cost-effective option.
We’re seeing increasing numbers of our customers turn to the cloud. A lot of that is being driven by the adoption of machine learning across all of HPC. Machine learning often results in very bursty workloads. These are workloads that need to be run for a short amount of time and are embarrassingly parallel. They are ideal for cloud bursting, where the cost of owning that hardware outright would outweigh the cost of bursting into the cloud. So, I don’t think there are many organizations in HPC that are not considering cloud, at least for some of their workloads.
Where Altair comes in is making the cloud easily accessible and a low-risk investment. When people started using the cloud in HPC some years ago, it was very hard work. There was a lot of automation that had to be done by the center, and it was highly risky because it was hard to control costs in the cloud. The cloud vendors make it difficult to work out what you’re spending and difficult to limit what you’re spending, so it is easy to overspend in the cloud.
Altair delivers that complete solution to smooth the journey to the cloud. We have solutions that help you discover and profile your on-premises workloads so you can identify workloads for running the cloud and help you package them for cloud migration. We also have solutions that allow you to burst your cluster into the cloud and make policy decisions about when to do that. Built into our products is the core capability to make sure that you’ve got the budget available, that the users or projects that have permission to run in the cloud are bursting into the cloud, that the applications that are ready for the cloud are bursting, and that you’re not spinning up any resources that then cannot be used.
Again, licensing is important. It is not quite as important as in EDA, but still important. There’s no point in spinning up a machine if you don’t have the license to run the tool once it’s there. Also, spinning that resource down once the workload has been completed is really important.
Again, it sounds easy when I describe it like that. But putting this in place and automating everything can be very complicated. That’s where our tools really do the heavy lifting. You can set budgets and spending limits in our cloud-bursting tools. You can allocate budgets to different individuals or projects, or teams. And you can set other restrictions, like making sure that workloads are running on hardware that has been benchmarked and that you’re not spending a lot of money on an expensive machine that is unsuitable for your workload. We also make sure that you’re not overspending by checking out the budget for a job before that job is run. That’s much better than billing you later on to let you know that you went over budget, which is what a lot of organizations are stuck with if they don’t use our solutions.
And importantly, we take care of a lot of the automation when it comes to the cloud bursting itself. We have multi-cloud connectors, which means you can easily burst into all of the major cloud vendors, including AWS, Oracle, Azure, and Google Cloud. So, if you really want to, you can have some workloads bursting to one cloud and other workloads bursting to another cloud. Our customers do that based on different capabilities. They might have a specific capability they want in the cloud, or it might be purely because of capacity. Some of our customers are running workloads that are so enormous that even the public cloud can’t deliver the number of machines that they want, and then they will want to burst to multiple public clouds at the same time.
There is a case study we’ve just published from Punch Torino about a customer doing that. This is a company that was founded by General Motors and then transferred to the Punch Group. Due to the transfer, they had to set up their own HPC infrastructure, so they had the really exciting challenge of doing that from scratch. They decided to go a hundred percent cloud, and they wanted to be multi-cloud to give them flexibility when choosing different capabilities in their infrastructure. They also chose the multi-cloud approach to reduce their reliance on one vendor and reduce their exposure to risks such as price inflation.
That’s an example where they really couldn’t have done it without our cloud-bursting solutions. They were able to put a multi-cloud HPC environment together within a matter of weeks using our tools. In contrast, it’s not uncommon for organizations who are building their own infrastructure using open-source software for HPC to spend upwards of six months on a single-vendor cloud-bursting solution that is then locked into one vendor and hard to change
RTInsights: Do you need to be an expert to use HPC?
Francis: No, we’ve got a solution called Altair Access, which allows users to access HPC compute without needing to be experts in HPC. It’s obviously a hugely valuable tool, not just for giving users access to the compute but also allowing them to leverage cloud bursting with complete transparency.
The user does not need to know anything about HPC or anything about cloud in order to leverage those compute platforms for their workloads. From their point of view, the experience is as if they’re running the tool on their own laptop at home. The configuration and control are in the hands of the system administrator, which means that it can be very tightly controlled in terms of efficiency and cost.
So, users get a great experience, admins get a great experience, and they can leverage all of the cloud-bursting and multi-cloud capabilities within the wider HPC infrastructure. Punch Torino leveraged Access as well for that. It’s not just a tool for accessing HPC, it’s also a tool for collaborating, and it allows you to share results and preview results within the platform as well, particularly when integrated with our HyperWorks simulation products. It’s really groundbreaking when taking HPC out to a wider audience.
RTInsights: Once you have all this in place, then what are the benefits?
Francis: The main benefit of the Altair HPC product suite is the return on investment. It can be tempting when comparing our products to open-source solutions to think, “You’re getting commercial support, but really if we leverage the community, can’t we do that same thing with open-source tools?”
We prove time and time again that we deliver a much higher return on investment with our commercial solutions. It is not just by delivering commercial support, although that is important. We’re the only company that delivers first, second, and third-level support for our products in HPC. But the key factor is the quality and flexibility of our tools.
We had a customer recently who had been using an open-source solution for their cluster. They then increased the number of tasks running on it beyond 50,000 concurrent workloads. They saw the cluster’s performance absolutely plummet to below 50%, with thousands of tasks queuing. They engaged consultants, and they worked with some of the open-source developers, but they just couldn’t get the performance to improve.
If you think 50% utilization of an HPC cluster is an enormous waste when you’ve spent millions of dollars on that HPC cluster in the first place, you are right. Once they’d switched to Altair Grid Engine, they started seeing around 95% utilization or above. They’re very short workloads, and obviously, there is some machine downtime and things like that, so it’s never possible to go up all the way to a hundred percent, but that’s about as near as to a hundred percent utilization as you can possibly get.
The ability to queue that number of jobs concurrently and make sure that the cluster was filled even with these short, high-throughput jobs was really where the value lay. And that’s what we deliver across our whole solution suite. Whether you’re doing HPC, EDA, cloud bursting, or on-premises computing, our tools can help you realize a significant return on your investment.