Benchmarking EDA Storage – Not As Straightforward As You’d Like
Most Electronic Design Automation (EDA) apps are not considered traditional High Performance Computing (HPC) apps which run with Message Passing Interface (MPI), but it can be argued that EDA is indeed HPC, as it demands similar compute and IO resources. IO is primarily with a POSIX filesystem, usually NFS based because of its simplicity. [Future blog post: how EDA can escape POSIX] It is well-known that higher IO latency is detrimental to JobEfficiency (CPUtime/Runtime), and in EDA you want efficiency over 95% for batch jobs. Measuring this is very simple from job-finish records your scheduler, but in a large grid, with many NAS filers, it becomes challenging to detect and associate under-performing batch jobs to a particular IO resource deficiency. Furthermore, without automated toolsets and Storage-Aware-Scheduling, the simplest way to address this is with Scale-out-Storage. Measuring the useful IO-capacity of the old and new systems is also a daunting effort, but recently, the members of the SPEC consortium have made the first attempt at creating a EDA specific workload for use in benchmarking.
There are a few different IO profiles depending on the type of work. SPECsfs 2014 SP2 (released in December 2017) has a benchmark profile called “EDA”. Within it are a couple of synthetic representations of Frontend (Design Verification, or DV) and Backend (Physical Design (PD), including Parasitic extraction). DV is traditionally small-memory per job, approximately 4GB, with very chatty smallfile writes, while backend jobs are traditionally large-memory (20GB+) and large-file streaming (the extract part). Because of this, it is sometimes desirable to dedicate smaller-memory machines to DV (Maybe 128GB@16 core, or 256GB @ 40 core dual processor machine). Then also dedicate a separate cluster of machines for Physical Design (PD), which has large memory say 512GB to 1.5TB. Another strategy which is becoming more popular is to combine the workloads on the same machines, running large memory PD jobs, and then filling in the left-over memory (20GB ±) slots to run DV jobs. SPECsfs attempts to simulate all three of these scenarios, running just Frontend, then Backend, followed with the mixed workload. [The above is open to discussion, there are many architectural strategies and optimization opportunities, which are constantly evolving. Maintaining an EDA grid is akin to painting the Golden Gate Bridge, starting at one end, and taking a year to reach the other, but using a different color every couple months. ]
The IO profile briefly mentioned as DV with smallfile IO has a few very interesting variants. Usually smallfile (4K, to 128Kwith 32K IO size) well-formed IO writes, but some workflows additionally write the IO-stream from the STDOUT of the tool being re-directed to the same scratch storage. This stream is in addition to the well-formed IO output from the tool, and this STDOUT stream is considered mal-formed IO, meaning it is all lseek-append, and an average of 1K IO size. This job-workload is now a bi-modal IO profile. Linux sometimes does a good job at buffering this and blocking it to be well-formed IO, but sometimes not. This problem compounds when DV jobs run at a large scale (up to 2000 or more concurrent from one user). The NAS storage is then presented with random-smallfile writes, the most challenging of all. At the same time, there can be a large number of deletes going on in scratch storage, which impacts performance, and creates latency for all of the contending IOs.
The SPECsfs benchmark becomes useful when you have exclusive (and well understood) resources of Compute, Network, and Storage, working on a scale which matches the expected IO capacity of your NAS.
Currently, with Scale Out NFS (Isilon, PureStorage, and Netapp Flexgroups) it is possible to achieve the scale-out performance which was once only attainable by parallel filesystems such as Spectrum Scale GPFS, Lustre, Panasas, and recently BeeGFS. Parallel filesystems is a provocative topic of its own, likely another blog post in the future.
The SPECsfs benchmark can be run on 120 or so nodes, and can create a small (four jobs per host) workload of 5GB/sec, or a large (16 or more jobs per host) workload close to 20GB/sec. The key metrics out of it are throughput and latency. SPEC prohibits any publication of benchmark results outside of their rigid audit and review. But in general, look for the highest throughput with a desired level of latency. Latency is the key indicator for storage (or network) subsystem saturation, and testing at lower job-densities (four jobs per machine) can help predict which is most efficient at full load of 16 or even 40 jobs per machine. You need to keep testing with increasing load factors, until latency exceeds some pre-determined threshold, at which point you have the most illuminating metric: throughput capacity of the Storage Device Under Test (DUT).
Be aware that synthetic benchmarks can highlight vendor-specific characteristics to perform for the better or worse, and they do not necessarily reflect what real-world results will achieve. There is no substitute for real-world application testing. In the future, I plan to write a blog about the endless quest to calibrate synthetic tests to real-world application testing.