Storage as just one COG in the Cloud

Part One

Prelude

We define a Cloud Oriented Grid (COG) to be a Compute Grid which is designed to operate independently, including a well-integrated data replication mechanism, and also has the ability to easily expand or contract in size. This is a two-part blog in which the first part will provide context and background to outline the problems of the first-generation attempts, and the second part will give solutions.

Background

Compute Grids contain fundamental elements of Compute, Network, Storage, Scheduler, and Applications. Common Schedulers are LSF, SGE, Slurm, Moab. For Electronic Design Automation (EDA), Compute platforms include custom compute (>768GB – 2TB) and general compute (4 – 8GB/core, high GHz). To explain a bit more, general compute is used for the Design Verification (DV) phase of EDA, where the memory requirements per job are normally under 4GB, and the large-memory or custom compute machines are used for Physical Design (PD) apps which consume 256GB to 2TB of RAM.

Problem

A very important component in EDA has been Storage IO Bandwidth. Long ago, you could see 40 to 50 engineers sharing a single NFS filer. Since about 2005, CPU cores became more plentiful and powerful, and now it is common to see a single user running 2000 concurrent jobs on a filer and bring it into IO saturation – the point where the filer latency is now inducing CPU/Runtime inefficiencies, and making jobs run longer. Longer running jobs also mean more License time, which is an easily understood cost increase.

Experience

A few years ago, I was the technical lead for a project to survey, and evaluate Parallel filesystems for use as replacements for which our traditional Netapp filer was not keeping up with IO demand. This journey is long and expensive, and Parallel filesystems are more complex than NFS. In the interim of this long-running evaluation, we selected a more high-performance, and less costly NFS server (Oracle ZFS Storage Appliance) which had the differentiating benefits of DTrace Analytics, real-time compression, and Hybrid Storage with very effective read and write SSD caches. The compression actually made the disk-IO smaller, and coupled with the excellent cache, the 7200RPM disks (at 1/6 the cost/GB) performed as well as a 10K RPM or even 15K RPM disks which were in common use with Netapp to get higher performance! This meant that we could use larger and cheaper Nearline SAS disks and the ZFS ZIL and L2ARC cache and by using compression to reduce the IO size, it was 30% faster than traditional fileservers. If you multiply 30% performance, 30% cost reduction, and 300% storage savings (from compression), you came up with a competitive platform which was much more effective. Netapp has recently caught up in performance domain with what can be called an interim-step towards Parallel NFS (pNFS) which they call FlexGroups. It effectively makes an all-active cluster of NFS heads serve (via round-robin DNS) the same filesystem to a large number of clients in the Compute Grid. Isilon had been doing this for a number of years, and PureStorage started doing the same with FlashBlade in 2017. They call it Scale-out NFS. In a large compute grid, if the network is properly balanced, it is not necessary to have a single client striping files across many different heads (one of the features of pNFS). You just need many clients to be distributed across many servers. Because of this, there is little attraction to pNFS tfor EDA, and Scale-out NFS will likely be the next step up.

Earlier this year, I was the lead benchmark and performance evaluator for in-house testing of these Scale-out NAS products. If you have the need, budget, and the resources to implement and support it, these solutions work extremely well. Again, they are more complex than the traditional two-headed NFS server of the past, but the vendors try their best to make them simple to admin. (Plug for Pure Storage which wins the most-simple-to-admin award). These Scale-Out NFS servers are still not true parallel filesystems such as Spectrum Scale (formerly GPFS) and Lustre. There are other parallel filesystems such as BeeGFS, WekaIO, Ceph, and Gluster, but for maturity and performance for EDA, GPFS and Lustre are the preferred options. Both of these have a very rich lineage going back many years with a talented developer and support base. In a previous blog “Benchmarking EDA Storage – Not as Straightforward as you’d like” I covered some of the unexpected quirks of the Black Art of benchmarking.

Context

Back to the COGs… Cloud Oriented Grids are not designed to be long-standing monolithic behemoth’s which is becoming of most in-house private cloud EDA Grids. COGs are designed to be self-sufficient, portable, and scalable. Modular design is a good term to characterize them, with a headstrong goal of K.I.S.S. EDA lifecycle goes something like this: design, synthesis, verification/simulation are all part of “frontend design”. After that, “backend design” which includes physical design, layout, place-and-route, and timing. Once all that is done, you can get to emulators, which run on massive quantities of FPGAs that make lots of heat, and emulate the chip, running actual firmware and software, albeit at a painfully slow clock speed. Emulators are not a massive IO bandwidth hog, because they require a lot of data, but that data is pre-staged, and compiled into the emulation target and uploaded to the FPGAs to run. Modern emulators of 2018 are now requiring more storage bandwidth but are still small compared to the tens of GB/sec of bandwidth needed by the other apps running across thousands of compute nodes.

Emulators are a very expensive and specialized (and a highly guarded competitive advantage) not likely make it into a Public cloud. There are also a bunch of “interactive” jobs like writing RTL/VHDL, or running GUI based programs such as ICFB which are not well suited for long-distance latency and remote desktop from the Cloud. That leaves batch jobs such as Verification and PD to be the remaining COG possibilities.

As previously mentioned the two are diametrically different beasts. One is small, short running, single threaded, fast CPU clock speed, and needs to run thousands of jobs at a time into the same filesystem (Design Verification). About 95% of the scratch data written can be purged within a few days after the analysis of the batch job was run. PD is large-memory, sometimes multi-threaded, and usually long-running jobs (6 hours, 2 days, or even 2 weeks is not uncommon). Those boxes are very expensive because of the memory requirements. The peak PD phase is only heavy for 4 to 8 weeks. StarRC (parasitic extraction) deals with the final products (netlists and databases) during the tape-out phase. It’s IO profile is more of a large-block well-formed IO.

Complexities

StarRC has had file-locking issues when running on unfamiliar storage or OS, which is challenging to debug and resolve. These problems occur only when multiple jobs are running on a machine trying to access the same file-lock. StarRC uses locks in a commonly shared NFS path to signal and control subsequent job start and stop. There can be four or so StarRC jobs running on the same machine, and another 200 machines with four jobs each. At this point, under NFSv3, we have seen coherency issues across the jobs. Sometimes it had been client-side NFS bugs in Linux, and sometimes it was NFS server side. In any case, debugging these across Scale-out NFS is very consuming. NFSv4 has a different and more stable locking mechanism. It is now built-in to NFS protocol, not separate NLM protocol. Because of this, the locking coherency problems are not seen there. Although NFSv4 has performance implications in many cases.

There are also some other advanced (and being actively developed) features of NFSv4 which can make it tough to deal with, such as file-delegations, and GSS-auth. Both of which are very attractive, but not easily manageable at large scale. NFS over RDMA is also a promising feature, (and does work well in certain cases) but I think it will be a couple years before we see it in common use for EDA. The point of all of this is that when attempting new or unfamiliar storage, and/or possibly moving to the cloud, you need to pick the right vendors and be aware that debugging issues may not be trivial.

To be continued…

I hope this piques your interest, and maybe resonates with some of your experiences. In Part Two, I will talk about the new-generation solutions and integration points which will address the challenges with first-generation cloud. This second part will provide the foundation to enable EDA in the cloud, with minimal change and disruption to users. Feel free to respond in the comment section, or in direct-mail, or via LinkedIn.

rmallory

Rob Mallory has 20 yrs+ in Tech industry including EDA, DOD/HPC, and tech. Specializing in advanced Storage and Filesystems, Networks, and Grid Telemetry. Currently branching out as CEO and Lead Performance Architect for the consulting group: RPM Systems after 15 consecutive years at a large EDA company as the lead hardware-engineering-compute architect. He has a passion for fast cars, fast computers, BBQ and other things which go fast.

Have you enjoyed the post ?

1 comment

Pingback: Storage as just one COG in the Cloud - Part 2 - HPC Pros