Most Electronic Design Automation (EDA) apps are not considered traditional High Performance Computing (HPC) apps which run with Message Passing Interface (MPI), but it can be argued that EDA is indeed HPC, as it demands similar compute and IO resources. IO is primarily with a POSIX filesystem, usually NFS based because of its simplicity. [Future blog post: how EDA can escape POSIX] It is well-known that higher IO latency is detrimental to JobEfficiency (CPUtime/Runtime), and in EDA you want efficiency over 95% for batch jobs. Measuring this is very simple from job-finish records your scheduler, but in a large grid, with many NAS filers, it becomes challenging to detect and associate under-performing batch jobs to a particular IO resource deficiency. Furthermore, without automated toolsets and Storage-Aware-Scheduling, the simplest way to address this is with Scale-out-Storage. Measuring the useful IO-capacity of the old and new systems is also a daunting effort, but recently, the members of the SPEC consortium have made the first attempt at creating a EDA specific workload for use in benchmarking.
There are a few different IO profiles depending on the type of work. SPECsfs 2014 SP2 (released in December 2017) has a benchmark profile called “EDA”. Within it are a couple of synthetic representations of Frontend (Design Verification, or DV) and Backend (Physical Design (PD), including Parasitic extraction). DV is traditionally small-memory per job, approximately 4GB, with very chatty smallfile writes, while backend jobs are traditionally large-memory (20GB+) and large-file streaming (the extract part). Because of this, it is sometimes desirable to dedicate smaller-memory machines to DV (Maybe 128GB@16 core, or 256GB @ 40 core dual processor machine). Then also dedicate a separate cluster of machines for Physical Design (PD), which has large memory say 512GB to 1.5TB. Another strategy which is becoming more popular is to combine the workloads on the same machines, running large memory PD jobs, and then filling in the left-over memory (20GB ±) slots to run DV jobs. SPECsfs attempts to simulate all three of these scenarios, running just Frontend, then Backend, followed with the mixed workload. [The above is open to discussion, there are many architectural strategies and optimization opportunities, which are constantly evolving. Maintaining an EDA grid is akin to painting the Golden Gate Bridge, starting at one end, and taking a year to reach the other, but using a different color every couple months. ]
The IO profile briefly mentioned as DV with smallfile IO has a few very interesting variants. Usually smallfile (4K, to 128Kwith 32K IO size) well-formed IO writes, but some workflows additionally write the IO-stream from the STDOUT of the tool being re-directed to the same scratch storage. This stream is in addition to the well-formed IO output from the tool, and this STDOUT stream is considered mal-formed IO, meaning it is all lseek-append, and an average of 1K IO size. This job-workload is now a bi-modal IO profile. Linux sometimes does a good job at buffering this and blocking it to be well-formed IO, but sometimes not. This problem compounds when DV jobs run at a large scale (up to 2000 or more concurrent from one user). The NAS storage is then presented with random-smallfile writes, the most challenging of all. At the same time, there can be a large number of deletes going on in scratch storage, which impacts performance, and creates latency for all of the contending IOs.
The SPECsfs benchmark becomes useful when you have exclusive (and well understood) resources of Compute, Network, and Storage, working on a scale which matches the expected IO capacity of your NAS.
Currently, with Scale Out NFS (Isilon, PureStorage, and Netapp Flexgroups) it is possible to achieve the scale-out performance which was once only attainable by parallel filesystems such as Spectrum Scale GPFS, Lustre, Panasas, and recently BeeGFS. Parallel filesystems is a provocative topic of its own, likely another blog post in the future.
The SPECsfs benchmark can be run on 120 or so nodes, and can create a small (four jobs per host) workload of 5GB/sec, or a large (16 or more jobs per host) workload close to 20GB/sec. The key metrics out of it are throughput and latency. SPEC prohibits any publication of benchmark results outside of their rigid audit and review. But in general, look for the highest throughput with a desired level of latency. Latency is the key indicator for storage (or network) subsystem saturation, and testing at lower job-densities (four jobs per machine) can help predict which is most efficient at full load of 16 or even 40 jobs per machine. You need to keep testing with increasing load factors, until latency exceeds some pre-determined threshold, at which point you have the most illuminating metric: throughput capacity of the Storage Device Under Test (DUT).
Be aware that synthetic benchmarks can highlight vendor-specific characteristics to perform for the better or worse, and they do not necessarily reflect what real-world results will achieve. There is no substitute for real-world application testing. In the future, I plan to write a blog about the endless quest to calibrate synthetic tests to real-world application testing.
Great topic, Rob.
Thanks Jerome! BTW, I’m heading to DAC the 25th-27th, and of course the HPC Pros event Monday night at the Thirsty Bear Brewing. Looking forward to talking with more enthusiasts over some good beer!
Good introduction. I’ll be curious to see how you’ve tackled this in your experience and seeing if they aligned with work I’ve done in the recent past. Testing at scale at least in my environment has always been challenging due to the never ending thirst for resources the customer bases imposes. Having a test cluster that’s just test and not pressed into service for a product deliverable has been a major challenge. Once they see more slots appear they tend to start frothing at the mouth a bit 🙂
Thanks Jason! Indeed, reserving 4 or even 8 racks of compute at almost $1M/rack (of 80 servers) is quite an investment. That point was always a great justification for storage vendors who loan out $1M+ systems for a couple/few months. These are high-stakes evaluations, and indeed there are only 10 or 15 EDA companies which can afford dedicated testing at scale. The way I got allocations was to work with the GRID team to configure one of the production LSF clusters to make a subset of machines exclusive to me for the hours of 7AM until 10AM every weekday. It worked out most of the time because the load was down during those hours (EDA designers sleep in and work late). On days which I had no testing lined up, or we were debugging issues with vendors, I could release those dedicated slots with a call/email to a friend. All in all, there is _no_substitute_ for testing in the real world, meaning I never had a pristine/clean-room environment, it was a subset of our production cluster, albeit that subset had some very fine-grained telemetry (10 second interval updates) on the OS and network which was special to it. Only at those resolutions you can see bottlenecks clearly.
I’m working with some prospective customers right now, we are designing and integrating an EDA foundation in the public Cloud. I’ll let you know how it goes when I ask the vendors’ for 10,000 cores for a couple hours of benchmark testing. 😉