A Brief History Of Semiconductor Computing (Part II)

February 4, 2018
Derek Magill
1
Cloud, Data Security, Grid Computing, License Management, Storage Management

Origin of the Species

Now that we have seen the technical and economic forces which shaped the modern EDA/HPC environment from Part I, let’s take a closer look at the implication of the intersection of all of these forces.

Grid Computing

In the early days of engineering grid computing, it was not unheard of for individual engineering groups to purchase their “own” servers for their departmental use. A typical scenario was for these engineering organizations to have people who were essentially “shadow IT” managing these resources. Things worked well, so long as the infrastructure spend wasn’t large enough to draw the attention of finance. However, as the work got more complex, executives paid more attention to the cost of these IT resources in their departments. Eventually, these shadow departments got moved into “IT proper” so that the budget could be centralized (and moved out of engineering) and managed accordingly.

However, even after moving the people to IT, the engineering departments usually claimed ownership of the computing and storage resources they had originally funded. This was understandable enough, but eventually the IT department noticed compute “hotspots” in this manual allocation of resources. This usually led to a manager talking to different engineering groups and negotiating one group “loaning” another group “their” resources temporarily. This persisted even after the original assets had been fully depreciated and retired, as the refresh nodes were by lineage still considered to belong to the original funding business group.

Unfortunately, this setup breaks as you begin to scale. IT managers began to consolidate their disparate compute clusters into large, single clusters. This helped to smooth out the peaks as one team would experience a demand valley while another experienced a peak. It also put the work of finding the appropriate machine onto the Distributed Resource Management (DRM) tool instead of relying on handshake agreements.

This pleased IT, Engineering Management and Finance. IT relied on a tool they were already paying for to automate tasks it was designed to automate. Engineering management didn’t have to spend as much time negotiating resources and generally had better access to a larger pool of resources. Finance got graphs which showed high compute utilization and a reduction in compute growth rate. Compute Professionals discovered the “right” graph format to show the overall CPU utilization and got used to presenting this information to finance on a regular cadence. Most shops found that if CPU utilization was below 75-80% that they could not justify new compute. Or If they were above 80, the screaming from engineering would get so loud (due to the effects of Queuing Theory) that they could convince finance to spend more money on computers.

And that is why today almost all Engineering HPC shops closely monitor CPU utilization and “think” of compute in terms of cores and utilization. Managers just can’t buy more computers unless their utilization has hit some sustained utilization rate. Years of presentations to finance and Executive Management have conditioned Semiconductor Compute Professionals to visualize computing according to these assumptions.

License Management

In the early 1990’s, GLOBEtrotter Software came to dominate the license management of EDA on the strength of their FLEXlm product. It held appeal both for EDA companies as well as for Semiconductor companies.

EDA Converges On An Informal Standard

Every software company on the planet must monetize their software in some way. In the early 90’s, EDA companies were looking for a way to do that, and they heard from their customers that they wanted FLEXlm. Specifically, FLEXlm did a great job of enabling the concurrent use model which is perfect for the “pleasingly parallel” nature of Semiconductor workloads.

So, EDA companies embedded the FLEXlm client libraries into their tools and began building out their back-end systems around fulfillment of FLEXlm-based licenses. Their tools won’t run (legally) without being able to contact a FLEXlm-based license server and all the major EDA vendors have teams who manage the back-end fulfillment system. Most of the people in those departments have never known anything but FLEXlm (now FlexNet Publisher).

EDA companies now generally all have the same revenue model with their largest clients — the 3-5 year contract. It is textbook MBA finance stuff. Through these agreements, they have extremely stable, predictable revenue streams for multiple years. This makes it much simpler to forecast and report out to Wall Street. Certainly their model allows for smaller, one-off purchases, but the bulk of their revenue is tied up in multiple-year commitments.

Semiconductor Companies and Licensing

There are other license managers out there — RLM, LUM, DSLS, and a smattering of others. But it is FLEXlm that rules the roost at Semiconductor companies. Most license teams groan at the thought of supporting one-off license managers (like, say, SlickEdit). They view these as “non-standard” and some refuse to support them. Why? Two major reasons: operational expediency and report data generation.

Semiconductor companies watch their EDA license spend like hawks. They are always under pressure to neither under-buy and impact schedules nor over-buy and impact the P&L. License administration teams straddle both sides of this equation. Because licenses represent such a significant cost, it’s important for the licenses to be available 24×7, 365 days a year. Any interruption in service means impaired assets, lost revenue and possible schedule slips. As such, they have developed very mature systems based around how FLEXlm works. They know it inside and out and they’ve learned how to keep it up and what its quirks are. Other license managers typically come from smaller, niche tools. And thus, non-Flex license managers are viewed as a distraction from the primary goal of keeping the “real” software investment up and running.

Now the flip side of this equation is that most License Management teams have a duty to provide usage data and reporting to Contracts and Finance to monitor utilization. Once again, license teams have embedded FLEXlm deep into their data collection, ETL (Extract, Translate and Load) and reporting functions. Most large companies established their reporting flows years ago and have incrementally improved upon them over time. Others are using the Flexera-provided Flexnet Manager for Engineering Applications, which purports to automate the collection, extraction and reporting functions needed to manage EDA software. Lock-in with Flexera is ingrained in most Semiconductor operations, from an operations perspective as well as a reporting standpoint. However, in recent years other companies have made headway in dislodging Flexera on the reporting standpoint — RTDA, OpenIT and TeamEDA being the most prominent.

Storage Management

Semiconductor storage teams face several challenges due to how the industry has evolved. Most of these challenges are in the categories of Provisioning, Performance and Protection.

Provisioning

Engineering teams keep asking for larger contiguous storage for their projects. They have more corner cases to run or they are moving to a new technology node or there is some other new requirement. The issue is that each NFS mount point is its own island, and that island has just so many disks and so many spindles (unless it’s SSD-based). In an ideal world, each project would have its own top level path and everything about the design would reside in that hierarchy. However in reality, projects get split up across multiple file servers and multiple project paths. You don’t want the Design Verification (DV) guys who run tons of simulations with lots of bursty, short reads and writes nailing the file server where your Physical Design (PD) guys are doing final signoff to get the design off to the fab. The workloads are vastly different and it’s impossible to tune an NFS file server to handle both running at the same time on the same file server.

Similarly, storage admins also face the very real problem of global fragmentation. Most everyone is familiar with defragmenting the hard drive on their computer. The defragger will pull all the pieces of files scattered throughout the hard drive and consolidate them into contiguous sectors to improve performance. That same concept applies to a fleet of file servers. In aggregate, you may have 1 petabyte of free storage left across all your file servers, but the largest contiguous block on an individual file server may be no more than a few terabytes. If you’ve got an engineering group that needs 500 terabytes of contiguous space, they either need to figure out how to break things up or you have to buy more storage. Most finance organizations would look at the reports and conclude that there’s enough free storage to handle the request, IT will just have to move things around to accommodate because there’s no room in the budget.

To combat these problems, several different filesystems have been tested and tried: Lusture/Gluster, GPFS, pNFS. All of them promise to virtualize your storage back-end and let you migrate data live without downtime or loss of performance. The promise has been vertical scaling and ease of management. And yet, these filesystems have not manage to take hold in this industry. Why? Cost is the number one factor. Storage costs are always under pressure because engineers never have time to delete data and IT staff don’t know what they can safely delete without participation from engineering.

Performance

Storage professionals are constantly looking at throughput and performance metrics and are looking to tweak some magic setting that will unlock hidden bandwidth from their servers. Read sizes, write sizes, cache settings, SSD-based hybrid solutions — anything to allow engineers to squeeze more performance out of the file systems. And, failing that, anything that will prevent the engineers from turning the servers into molten piles of slag. Performance and provisioning go hand in hand. Engineers will unleash thousands of jobs against a single directory structure which exist on a single file server. It’s not a malicious act, it’s simply a function of the growing complexity of chip design. But that doesn’t change the fact that one engineer can bring a filer serving multiple groups to its knees.

Engineering workloads are not homogenous. Different phases have different characteristics. Simulations have lots of sequential reads and writes and system stats against the file server. Back-end PD jobs have long run times and will read and write very large files (tens to hundreds of gigabytes in size). Other functions have variations on those themes. The point being is that variability is the enemy of optimization. If you have one kind of workload profile, you can really optimize to that workload’s characteristics. So you have to segment the population and tune where you can. Unfortunately, this costs more, so most storage admins have to try and make do with as much general-purpose infrastructure as they can afford.

Protection

Storage groups today live and die by NFS. Specifically, NFSv3. For the Windows-users among us, NFS is a forerunner to CIFS shares. All modern filers offer NFS and CIFS licensing, and your typical UNIX-focused storage admin will grumble that CIFS is a “chattier” protocol than NFS. And the reason is that CIFS shares carry all of that extra kerberos authentication traffic along with it, which NFS lacks. NFS, much like NIS which was a network service built in the same era, depends on the assumption that the administrators control both the horizontal and the vertical when it comes to system authentication and authorization. But the world has changed. Now, it is a trivial matter to set up a Linux Virtual Machine, issue a few mount commands and BAM you can impersonate anyone because you have root access to the VM.

So why, do you ask, have we not moved to a more secure protocol? Speeds and feeds. All that extra encryption overhead slows down access, and anything that slows down results is bad. But things are changing, and I don’t think it will be much longer until the business decides that the extra overhead is the cost of doing business. It will probably also be up to IT to figure out how to recoup that lost performance.

One interesting side effect is the collision of legacy storage protocols and modern security requirements. As we will see shortly, there is pressure on the business to enhance internal data security. In Linux, this means assigning people to multiple groups and ensuring that permissions on the data set allow group, but not world, access to those groups. But there’s a catch. The default authentication mode for NFS, auth_sys, imposes a limit of 16 groups per user. This outstanding post from 2011 outlines the challenges and solutions. Bottom line, developers and administrators are having to prop up an aging technology to keep up with modern requirements (again).

Data Security

As mentioned in Part I, most Semiconductor companies “grew up” in a collegial atmosphere, with the emphasis on engineering productivity through sharing. Offshoring extended that model across borders and led to large, flat global networks. Some companies still have problems with mounting NFS shares over the WAN.

Security Professionals at Semiconductor companies have long had to put more emphasis on usability than security, but that trend has been changing in recent years. Several factors are contributing to the need to lock down data in ways that were unimaginable a few years ago.

The rise of the use of third party IP imposed obligations on companies that they never had to worry about before. Every IP contract has its own specific terms about who can use it, when it can be used and how much it costs based on the way it is used. Suddenly these flat, open networks with project directories with 777 permissions on them were a huge liability. Just a few short years ago they were an absolute necessity to meet schedules. Many companies engage in a whack-a-mole of turning top level project directories to 775 or 770 in order to maintain some kind of control over who can access what. Unfortunately in many cases, those permissions get opened back up shortly after they get locked down because “something broke”. Flows are often to blame for this, as they typically have been around for a long time and are complex enough that only one or two people on the flow team actually know all of the ins and outs of the dependencies anymore. And yet, the need to more rigorously police access to IP remains. It is a journey many companies are on today.

Most Semiconductor companies employ contractors (long and short term temporary employees) and consultants (defined scope/time experts) to help get designs out the door. In many cases, these temporary employees come from one of the Big 3 EDA companies (Cadence, Mentor, Synopsys). And once they are in the door and onboarded, they pretty much have access to anything they want. Including getting a look at license utilization for their competitors. And probably competitors’ IP. I am not attempting to impugn the dignity of these professionals, but the fact of the matter is that in almost all cases they should not even have access. But due to the other historical pressures of openness and legacy flow needs, locking things down is difficult.

More recently, Linux vulnerabilities have become more publicized and have their own marketing campaigns. Shellshock, Heartbleed, Dirty Cow, and a host of others have made the tech press. CIOs and Boards of Directors now have security on their minds and are demanding tighter controls from their engineering teams. Schedules haven’t changed, but the requirement for tightening security and access has. This leaves it to IT and Information Security Professionals to bridge the gap between security and usability.

The end result if you are a supplier to a Semiconductor company is that you will hear an awful lot about the need for security. It’s an absolute requirement. Oh, but we need flexibility to grant exceptions at either an extremely broad level or at an extremely granular level. The contradictions in requirements must be enough to cause a product manager’s head to explode. But those opposing forces are buried deep inside the use model of a Semiconductor company. They are wrapped in politics, legacy flows and scripts, and a heritage of openness.

Putting It All Together

The 1980’s and 1990’s saw the rise of the modern EDA company. Early versions of tools ran on UNIX and transitioned to Linux because it was an easy jump which increased performance and reduced cost. As chips got more complex and more compute nodes were needed, shared NFS storage was a necessity and with the advent of the NFS “appliance”, Engineering IT shops began buying filers at a great clip to keep up with the explosion in data as process nodes kept shrinking. Internal EDA/CAD teams took on the role of operationalizing EDA software by stitching together flows, typically using perl. Customers and suppliers gravitated towards the use of FLEXlm license management for day to day operations as well as reporting. Open networks enabled engineering productivity, and offshoring extended the open network concept globally. Third party IP imposed new duties upon Semiconductor firms, and name-brand Linux vulnerabilities are changing corporate attitudes about the balance between usability and security.

The net effect of all of these forces is an industry that, paradoxically, is building the future based on the past. The industry has calcified around a core set of technologies and every year the interest payments on technical debt keep accruing. From a technology perspective, we are witnessing The Innovator’s Dilemma played out on an industry rather than a single firm. We are iterating for very little incremental gain.

But what forces do we face ahead? Now that we have analyzed how we got where we are, what is shaping the future? In the next post, we’ll analyze market forces shaping the industry itself and its impact on technology adoption.

dmagill

Derek Magill has been in Engineering IT for over 20 years, starting out on the UNIX Help Desk at Texas Instruments. While at TI, he held several technical and leadership roles, mainly focused on the areas of license management and HPC. While at Qualcomm, he led the global EDA License Infrastructure team, the Grid Administration Team, and was the primary Engineering Cloud Architect. He currently is an HPC Solutions Architect at Flux7 Labs, a DevOps Cloud Consultancy. Derek has also served as the Chairman of CELUG since 2015 and is the Executive Director of the Association of High Performance Computing Professionals. He also is a member of the Executive Committee of the 2020 Design Automation Conference.

Have you enjoyed the post ?

1 comment

Pingback: A Brief History Of Semiconductor Computing (Part I) - The Association of HPC Pros