Choke Points -- Interest Payments on Technical Debt
Most of us are familiar, at least instinctually, with the concept of technical debt. As an industry, we are all collectively making interest payments on close to 40 years of it, and it’s taking its toll as we try to scale out in order to deal with the ever-increasing amounts of compute and storage required to design to smaller geometries. Design decisions that made sense when you could count on internal LAN environments that scaled to hundreds of nodes and a few storage appliances break down as you move into a world where you need to scale out to thousands of compute servers with dozens of storage heads.
All of us at one point or another live with what I call Choke Points.
Let’s start by going through a litany of issues that IT Professionals and Engineers alike dread seeing.
NFS server zeus not responding still trying
Engineers know that when they see this, it’s time get get some coffee because they’re not going to get any work done for a while. Storage Professionals know that when they see this, it’s time to call their spouse to tell them that they’re going to be home late. Recall earlier that most companies are using storage from a handful of vendors. They all have their selling points, but they all also have single “heads” which are usually in some clustered mode for high availability. Those single heads can only handle so many transactions/IOPS/megabytes per second before they can’t keep up anymore. And when that happens, every node in the network that’s trying to talk to that file server sits there spitting this message out continually. It usually takes hunting down some set of jobs that are hammering the file server and killing them to restore service, assuming that the problem is not due to a hardware failure or some glitch in a takeover event. Point is, your multimillion dollar investment in a compute grid is at least partially idle until this problem is addressed.
batch system daemon not responding … still trying
This particular message is very LSF specific, but surely Network Computer and Univa Grid Engine have similar messages. This is yet another error message that strikes fear into the hearts of engineers and Compute Professionals alike. When this message is showing up, no new work is getting scheduled to compute nodes in the cluster. It typically doesn’t cause running jobs to die, but throughput goes down substantially when your batch scheduler is under so much load that it can’t respond to requests anymore. Batch schedulers live on a single node with an optional “hot spare” waiting to take over. But if there’s more network traffic coming in than a particular system can handle, failing over usually doesn’t help. Until you get this fixed, your entire compute infrastructure is sitting there, waiting for more work to hit.
Can’t connect to license server (-15,12:61) Connection refused
License servers are definitely single points of failure. While Flexera offers their “triad” solution, there is at least one major EDA supplier who recommends that you keep all of them on a single router so that they don’t lose contact with each other and lose their quorum. Most shops use singleton mode because it’s easier and generally reliable enough. Until it isn’t. And once a license server starts spitting out this message, you’re on a timer until jobs start dying because they can’t connect to the server to get their heartbeat. License servers going down take down huge swaths of compute servers from being productive. They rely on stateful TCP connections and simply can’t tolerate a host being non-responsive. And because checkouts consume a file descriptor on the server, there’s only a finite amount of concurrent checkouts you can realistically handle.
mvfs: ERROR: view=joe_view vob=/vobs/production/release_vob – Communications timeout in ClearCase server
For those of you lucky enough to still use ClearCase in your environment, this type of error message instills immediate terror into administrators and engineers alike. ClearCase is the CM system that would be an Operating System. It uses the MultiVersion File System (MVFS), which is a kernel module. The history of MVFS goes way back to the old Apollo file system from back in the 1980’s. Some developers from that company liked the multi version capabilities and went on to form Atria Software, which merged with Pure Software, which then merged with Rational Software, which then was bought by IBM in 2003.
ClearCase utilizes centralized VOB (Versioned Object Base) and View servers to present a file system that uses a config spec in order to see objects in the file system. Changes to the config spec alter what an engineer sees. Unfortunately, since this is now 30 year old technology, it relies on a lot of the same assumptions which plague us in other services: Central VOB server, central view server, NFSv3. If you put too many VOBs on a single server, you’ll have resource contention. Same for views.
So what do we see in common with all of the above paradigms? Let’s break them down.
Reliance on Big, Centralized Servers
Back in The Day, IT administrators could count on having total control of the environment. They controlled the horizontal and the vertical. Users had accounts at their whim and workstations did not appear until an admin deployed them. Servers were centralized for ease of administration. This paradigm works well when you’ve got a manageable number of clients. But inevitably, Semiconductor Computing scaled out, and when it scaled out just having one or two servers to manage a service wasn’t enough. Admins deployed business units specific servers. Or IT striped BU’s across several different servers per service. But eventually, one group, in a fit of tapeout CPU consumption, brought one or more of these servers to their knees.
Result: a broad swath of compute sitting idle, waiting for the centralized resource to respond to connections.
Stateful TCP Connections
Since software developers and system administrators could count on large, monolithic servers being available all the time, they coded software applications with that fundamental assumption in mind. A process opened a socket which connected to a port on a server and hung out listening to that port. If the service on the other end stopped responding, well that socket just sat there in a fit of futex, waiting. On a small scale, this is not really a problem. Scale an infrastructure out to thousands of nodes, all demanding a socket on a specific server and you’ve put a significant number of eggs into one basket.
What to do
So what can we possibly do? All of these assumptions are baked in at the application software level, and we don’t have control of that layer because we purchase our software from third parties. Let’s face it. If we’re going to build the future of silicon, we need to update our computing paradigms. In the next post in this series, I will compare the Semiconductor computing model to newer paradigms and talk about what we will need to do in order to modernize.