The Origin of the Species -- How Today's Semiconductor Environments Evolved
The current paradigm for Semiconductor computing dates back to the 1990’s. It was a heady time for the industry — the Internet was a nascent technology and PC’s were becoming more commonplace in both businesses and homes. There was growth everywhere for chip makers and EDA software developers.
The Origins of Grid Computing
During the 1990’s, the industry embraced UNIX as its platform of choice for scientific computing. Sun Microsystems’ Solaris was the most popular choice for computing, but there was a good showing for HP-UX from Hewlett-Packard as well as AIX from IBM. By the early 2000’s, Linux had started to mature enough that it was considered a “good enough” alternative to enterprise-grade UNIX by many, and servers running an Intel or AMD chip were dramatically cheaper and faster than their “big iron” counterparts. In 2002, IBM famously announced it would send $1 Billion on Linux in the next year. The psychological effect of seeing a “big name” like IBM make such a public endorsement of the upstart operating system helped convince IT upper management to give Linux at least a look. EDA software suppliers began porting their code to it. And there were finally “throats to choke” when it came to commercial OS support — Red Hat and SUSE. The confluence of these events led to the eventual industry migration from proprietary UNIX to Linux. While the industry now runs on Linux, UNIX was the bedrock upon which the current computing model was built.
Another key component to today’s architecture is the rise of the Distributed Resource Manager. As chip design became more complex, engineers needed to run hundreds or even thousands of jobs against their designs. Having scripts which scheduled where a job ran was not feasible, and so IT administrators cast their nets in search of tools to manage this workload. Several tools existed, including Platform LSF, Network Computer, PBS, MOAB, Sun Grid Engine (SGE) and Condor. Most companies selected Platform (now IBM Spectrum) LSF, though SGE (now Oracle Grid Engine) and Network Computer have a presence in the industry. DRM tools gave engineers access to 24×7 computing. A common use case was for a Design Verification Engineer to queue up thousands of jobs at the end of the day, let the jobs run overnight and come in the next morning to review the results. Of course, all of the data that these jobs generated had to be stored SOMEWHERE…
UNIX had the ability to see the same file system across many different servers using the Network File System (NFS) protocol. This was an ideal use model for many different types of Semiconductor workloads, as it allowed for increased parallelism and rudimentary “message passing” in between processes running on scale out servers. Companies such as Auspex, Network Appliance and EMC began offering “appliance” servers which were tuned for massive NFS operations, relieving system administrators from building their own file servers out of UNIX systems with haphazard SCSI disks hanging off of them. Auspex introduced the first network file server devices to the market, running a version of the Sun Solaris operating system. This was a nice interface for EDA HPC admins who had already embraced UNIX as their operating system of choice. Auspex flourished for a time, until competitor Network Appliance brought a new, focused OS called Data ONTAP to market in servers affectionately known as “toasters” (embracing the “appliance” aspect of their name). The OS was custom-built and vastly simpler to use. Netapp eventually supplanted Auspex for dominance in the Semiconductor industry.
A fundamental assumption most companies made about data security was the UNIX servers could be trusted to present only valid user credentials to an NFS server. For a time this was valid, as only IT departments could procure the expensive equipment and configure it to use network resources. This convenient assumption worked well for the NFSv3 protocol, which inherently trusts that the credentials presented by a machine which can mount a volume are correct (per the auth_sys mechanism). With the rise of Virtual Machines and easy to set up Linux distributions such as Ubuntu, getting a machine to access files that normally would be “off limits” due to UNIX permissions is trivial unless precautions are taken with regards to export rules and a judicious use of netgroups. NFSv4 is supposed to fix this, but the protocol has been under debate for a long time and industry uptake has been lackluster at best.
Electronic Design Automation (EDA) Software, Licensing, and Flows
Many older semiconductor companies in the past built their own Electronic Design Automation (EDA) software. The 1980’s saw the ascendance of companies specializing in the creation of EDA software. Companies such as Daisy Systems, Mentor Graphics and Valid Logic Systems (collectively referred to as DMV) pioneered the software that automated the manual tasks of drafting and laying out designs which had been done up to them. Other companies such as Cadence Design Systems, Magma Design Automation and Synopsys came to dominate the field by the 1990’s. Semiconductor companies quickly decided it was better to buy than to build (with the notable exception of IBM) and have never looked back.
Semiconductor companies no longer design in-house EDA software is, though almost all have teams developing “flows” which take EDA software from either different suppliers and/or different functions and stitch them together with scripts (usually perl, to the dismay of anyone under 30). The flows take a logical workflow and take inputs from engineers on where the data lives and then they set the environment such that the engineer doesn’t have to think about setting all of the correct command line switches or environment variables. Most flows have evolved over many years and are intimately tied to the specific IT environments in which they developed. The complexity of some flows can rival the complexity of traditional compiled software, and small deviations in the IT environment or even outputs from tools in the chain can cause a cascading effect of failure throughout a flow.
One early problem that needed to be solved was the licensing model for EDA software. There were several third-party license management tools available which kept track of how many seats were available for use at any given time — FLEXlm, Elan, NetLS and probably others. Matt Christiano, the original author of FLEXlm and later the Reprise License Manager, has an excellent blog series detailing the history of license managers. The EDA industry informally “standardized” on FLEXlm, which ran on all the popular UNIX distributions as well as Windows. FLEXlm not only gave companies a rigorous way of enforcing licensing terms on a technical level (unlike the Byzantine and “honor-system” based licensing from Microsoft, for instance), but it provided back-end systems which EDA companies plug in to their financial systems as license capacity gets purchased. Through it’s SAMSuite reporting application, FLEXlm also gave companies detailed checkin/checkout information on who used software. Many large companies developed systems to automatically extract this information and report back on it, further embedding the technology into the fabric of its infrastructure. Most IT personnel who managed licenses were pleased to see the industry consolidate down to a single supplier, as the multitude of incompatible solutions had inconsistent quality and overall led to operational complexity.
Connecting the Dots — The Network Is the Computer
John Gage from Sun Microsystems coined the phrase “the network is the computer”, and that certainly described the important role that the network played in the evolution of Semiconductor HPC architecture. Networking evolved rapidly in the 1990’s. Old corporate networks ran on protocols like Token Ring, which is the basis of one of my favorite Dilbert strips. But newer technologies like 100Mb ethernet finally took small collections of engineering computers in “data closets” and put them into bona fide datacenters. HPC Professionals told engineers that they could see dramatically faster performance if they ran their jobs inside the datacenter with their speedy 100Mb connections between compute servers and file servers. Co-locating the compute and storage to reduce data access latency over NFS made a tremendous amount of sense. These were all well known and understood protocols and concepts. But it was truly the fast, scalable LAN environment which allowed the model to flourish.
During this time, Semiconductor companies such as Texas Instruments began hiring engineers in Bangalore, India. This phenomenon meant that the datacenter building blocks created in the US had to be replicated to the other side of the world — along with the data. Now, instead of having a single large datacenter, companies had two or more. They began hiring talent wherever they could find it. And of course, if you are going to all of the trouble of hiring someone in another timezone, they need to be productive during their work hours. The push was on to enable engineering productivity wherever engineers sat. Design schedules got increasingly tight, and tied to critical time periods such as the Christmas season or Golden Week in China. Engineers needed access to any data they required at any time, day or night in order to meet schedules.
Most engineering managers viewed perimeter firewalls as IT’s sole responsibility to slow access to data. Anything that hindered an engineer on the internal network from gaining access to data was a productivity killer and had to be remedied immediately. There were two underlying assumptions to this mentality: anyone who was inside the firewall was trustworthy and all that the company owned all of the intellectual property the company used to create products. Enlightened readers will recognize the fallacy of the former. Regarding the latter, the rise of the use of third party IP created legal obligations that in many cases were not enforceable by the IT constructs in place. Because there has never been an engineering director who was praised for getting a chip out late but taking the proper security protocols, most IT administrators did the best they could to balance data security and productivity, but usually the controls are a series of hacks and kludges which are not robust enough to survive modifications to the environment.
Stitching the Pieces Together
Now we have all of the essential components that make up the architecture that virtually all Semiconductor companies today use in house — Compute, Storage, EDA Software, licensing, and networking. We also have the economic pressure for the offshoring movement. All of these components came together to create the unique structure of Semiconductor HPC workloads. Scale-out Linux servers connected via ethernet and NFS NAS appliances that are typically open environments with few security controls around data. In Part II, we will analyze how all of these pieces fit together and what some of the consequences are.
The next post, The Origin of the Species goes into more detail of the history, talking about the technological, social and business drivers that propelled the industry forward onto its current infrastructure paradigm.