How system disaggregation would reorganize IT, and how Arm may benefit

The phrase system disaggregation or server disaggregation refers to a design pattern for both manufacturing and deployment of a computing system whose component parts are constructed as individual units, and connected using network fabric instead of a system bus. Today, a computer’s motherboard incorporates a central processor, main memory, a chipset that manages the motherboard’s internal functions, interfaces to the other external components a computer may use (e.g., display, storage, networking), and a bus that acts as a circuit tying them all together.

a-bit-of-piet-mondrian.jpg

Segment of “Broadway Boogie-Woogie” by Piet Mondrian, in the public domain.

A future computer might be built very differently. So long as network fabric (short-distance, high-speed interconnections over high-grade cables such as fiber) are at least as fast, if not faster, than a bus, then the components that were once rooted to a motherboard may become more loosely coupled — perhaps manufactured separately. In such a market, such components would be interchangeable.

If you think we’re only talking about computers for the enterprise or large organizations, consider a connected home office (a task made much easier by the unforeseen circumstances of 2020) where a centralized hub serves as the streamer of video, the gateway for cloud-based applications, the server for stored documents, and the temperature gauge for your refrigerator. In a universe of disaggregated parts, devices perform simple tasks, and combinations of interoperable devices perform complex applications.

So why would you want this?  After all, your smartphone is becoming more capable by the month at running spreadsheets. This is a question worth asking, repeatedly, because the answers thus far are only partial.

The new parts of a disaggregated system

You may have seen the term “disaggregation” applied to a network and presented in that context as a design trend. The common element among most of the definitions of network disaggregation is that many components in a system are networked together, rather than welded or stamped out together. A more accurate synonym for this term could arguably be “networking,” since its whole principle is to utilize addressable connectivity.

Network disaggregation is the decomposition of a network into more interoperable parts than it had before. Server or system disaggregation is, by comparison, the decomposition of the components of a computer system into self-operating parts that are connected by a network rather than by wires, sockets, and interfaces.

The most important system aggregation architecture presently under discussion would do the following:

  • Decouple the server’s own network interface card (the part that enables a system to communicate with, and pass data between, other systems) and the CPU, on an entirely new bus marshaled by a new class of component, which some manufacturers call the DPU (a name believed to have been coined by Nvidia, but not fully decided upon just yet);
  • Move data storage onto an interface managed by the DPU, so that the CPU accesses stored data utilizing the network rather than the system bus, thus enabling policy and security controls for the connection;
  • Conceivably, decouple the GPU (formerly just the graphics engine, but more recently utilized for highly parallel processing), along with FPGAs and other so-called accelerators, from the CPU, moving them onto another separate component that communicates with the CPU and DPU on the same network;
  • Disestablish the one-to-one relationship between all three of these components, so that a CPU may engage as many DPUs as may be necessary for a task, and that both CPUs and DPUs may utilize GPUs and accelerators — conceivably expediting the execution of certain network management tasks by orders of magnitude.

What is a DPU and why would anyone need one?

Sometimes called a data processing unit, a DPU would be centered around the processor in its network interface card, or more specifically, its SmartNIC — a network controller managed by a programmable processor. In another class of system, this processor may as well be a CPU or an Arm system-on-a-chip (SoC) (if that phrase seems familiar, you may have read our recent explainer on Arm processors). This DPU would probably not be a “card” anymore, but rather a node — an addressable component of a network.

dpu-bluefield-2x-3-1024w.jpg

A would-be Mellanox SmartNIC becomes an Nvidia DPU: The BlueField-2X.


Nvidia Corp.

In its early October conference, Nvidia premiered its first Mellanox SmartNICs to receive its DPU branding, the BlueField-2 and 2X. Notice that it’s still a PCI card, at least in this incarnation, not some separate box with Ethernet ports. In a videotaped presentation, Nvidia CEO Jensen Huang asserted that software-based processing regular expressions (RegEx) for decryption and authentication of IPsec inbound packets at 100 gigabits per second (Gbps) would consume the work of 125 standard CPU cores.

Two facts Huang omitted:  1) We don’t know yet how much faster a BlueField-2 system would perform by comparison yet.  2) Does any server delegated the task of inbound packet processing that’s worth its weight not have some kind of DPDK accelerator, like one of Intel’s existing FPGA cards, for offloading these tasks off the CPU today?

“I think that future infrastructure will be optimized by a new class of network adapter,” explained Paul Perez, CTO of Dell Technologies’ Infrastructure Solutions Group, during a recent VMware conference. “I hesitate to call it a ‘network adapter.’  Think of it as a collection of general-purpose Arm cores, its own complement of offload engines, high-performance memory available to that NIC locally, as well as a well-balanced collection of local-host interface bandwidth and data center fabric-based Internet bandwidth. It’s not your grandfather’s NIC.”

That’s several networking terms in just a few sentences, so let’s break them down a bit:

  • Offload engine: A specialized processor for handling frequently performed classes of tasks, such as encryption and decryption. This is not a new concept at all; in fact, Intel pioneered this market in the last decade with Xeon Phi co-processors.
  • Local high-performance memory: Because network cards often utilize extensive lookup tables, they’re typically equipped with ample caches of static RAM (SRAM). Typically, this is more volatile than DRAM, but it’s not meant for long-term storage anyway. By “ample” in this case, we’re still talking megabytes rather than gigabytes, but think of it like a small gear with high torque. When perusing an indexed data table, a processor could make good use of a high-speed cache for something other than just its own internal logic.
  • Local-host and fabric-based bandwidth: This refers to the relative level of connectivity of components that are now addressable over a network, that historically have been interfaces on the local bus. If the bus is truly to be broken up, the local data fabric will need to be a veritable autobahn.

Originally, a NIC’s processor was a fairly common, application-specific integrated circuit (ASIC). It wasn’t too sophisticated. From its initial, PC-based design, it was a PC-based card that fit in a PCI expansion slot. PCI is a bus — literally a route with stops on a computer’s motherboard (originally called “buss,” for no particular reason) that carry data in parallel between the PC’s memory and the internal memory of all the interfaced devices. One of the NIC’s most necessary functions was converting serial data (the type that comes in over the network) to parallel data that can be shuttled between processors and memory.

By contrast, while a SmartNIC may also be, to a certain extent, application-specific, it’s based on a general-purpose, multi-core design that is programmable using a high-level language. That’s extremely important, because it means certain network tasks — the most important of which are classifiable as security — may be offloaded from the CPU and delegated to the DPU. (By the way, “SmartNIC” is not a brand name, even though the “S” is capitalized, and it tends to read that way, especially to anyone who grew up with the electronic devices of the 1970s and 80s.)

Tracing the DPU value proposition

Graphics cards with independent processors first became generally available in the 1980s. At that time, I explained to PC owners that 3D graphics as a task required a fundamentally different type of processing than a serial logic processor was capable of performing: one in which a single class of task, with different inputs, is replicated among hundreds or even thousands of threads, and executed in parallel. Graphics processors became general-purpose “GPUs” when Nvidia, their leading producer, built a library called CUDA that made it feasible for scientific developers to leverage the same parallel processing engine for other purposes.

Why shouldn’t the GPU design pattern work well enough for DPUs?

It’s difficult to make the same case for DPUs. As SDN has already proven, CPUs manage network control planes and data planes quite adequately. So the DPU argument is being framed as a pre-emptive response, casting the data plane as something that will soon overburden the CPU, the way graphics once did.

“All of a sudden, many of the functions that are network-centric are moved into network-optimized resources,” explained VMware CEO Pat Gelsinger during a recent press conference, “allowing us to take fully distributed capabilities across the data center, or edge and cloud as well. As a result of that, you’ll start seeing a lot more security functions, and you’ll be able to have higher-bandwidth capabilities that today, essentially, choke a CPU because of the overall bandwidth requirements.”

Nvidia built its foundation on a market that was effectively peeled off of the x86 CPU arena that Intel was leading, and where AMD and a company called Via were emerging as challengers. And it built a completely new use case — accelerated artificial intelligence — on top of an otherwise unrelated platform.

Clearly, Nvidia is working towards a repeat performance. It has bet the farm — or perhaps a country with several thousand farms — on owning Arm Limited, a processor company that doesn’t make any of its own processors, just to become the direct recipient of its revenue stream from intellectual property. But not only that:  Nvidia had earlier acquired Mellanox, a leading producer of SmartNICs managed by Arm processors. Now Nvidia seeks to peel off another emerging use case currently relegated to general-purpose computing, and build another revenue platform off of a fully-functional processor which yet another use case — perhaps one completely unrelated to SDN — may one day leverage.

Graphics was a clearly defined need that the first GPUs addressed just in time. Semi-autonomous data stream management is not so clearly defined a need. What’s more, x86 CPUs continue to accrue more and more cores, advancing the counter-argument that it may be easier to delegate new cores to SDN management than to peel off SDN altogether and recompile it for a wholly different architecture.

What would DPU-style system disaggregation accomplish?

There are a variety of arguments in favor of the DPU approach to server architecture, some of which appear to have been appropriated a little after the fact, though some others of which may have some of their own organic merit.

  • The data plane may be managed more directly without taxing the CPU.  You may recall from ZDNet’s earlier introduction to software-defined networking (SDN) how the flow of the network is divided into the control plane and the data plane. With respect to data, you’ve already heard the warning bugles about how much more data there is in the world. If the pursuit of Moore’s Law hadn’t had a head-on collision with the laws of physics, this actually wouldn’t be a real problem right now. But the idea of dedicating processing cores outside of the CPU to a networking-related task is, at the very least, intriguing. What we have yet to see is actual, verifiable evidence of efficiency gains.
  • A DPU may act as a kind of storage server for the CPU.  Let’s be perfectly honest and admit that this is already being done — in fact, it has been for quite some time. Network-attached storage systems are managed by multi-core processors, which in many cases are yesterday’s CPUs repurposed for storage systems.
  • Isolating the processor executing network policy from the central processor may eliminate old, ineffective system perimeters, and greatly increase system security.  This is the argument that may have the greatest merit, outside of observing actual DPU performance in a working environment. System vulnerabilities often involve exploiting deficiencies in the software being executed by the CPU, so that the CPU’s guard will be brought down and made to run privileged instructions that access sensitive data. If security were delegated to a processor other than the CPU, then this means of exploit might not work anymore.
  • A DPU may be leveraged as a kind of accelerator for the kinds of network-oriented processes never before considered.  One example Nvidia offers is RDMA offloads, which could be useful in machine learning (ML). Remote Direct Memory Access (RDMA) refers to the ability to read the memory of another system, connected through the network fabric. ML is very data-intensive. Once memory became cheap, it became feasible for processors to stage their entire training stages in huge pools of DRAM. Distributed computing nodes need to be able to utilize segments of those huge pools, without them having to be written to disk first. A DPU could become a network gateway to other servers’ memory, potentially revolutionizing the orchestration of distributed applications.

At the last VMworld 2020 virtual conference, VMware announced the next generation of its virtual infrastructure management platform vSphere. Project Monterey aims to weld the Kubernetes orchestrator permanently into the platform (at present, it’s an attached layer), but another of its goals is to converge CPU, GPU, and DPU address spaces into a single environment. This would enable the orchestrator to stage virtual machines and containers in the DPU space, in the same manner as the conventional CPU space, adding DPUs to the pool of distributed computing nodes.

“We’re really talking about an expansion of our thinking,” remarked Kit Colbert, vice president and CTO of VMware’s Cloud Platform Business Unit, “around how compute is done, and where.”

Who benefits most from disaggregation: the user or the vendor?

There are some potential, intriguing benefits to such an arrangement. Clearly it would benefit manufacturers such as Nvidia and Arm, which last month announced their intentions to become one company. Intel’s dominance thus far has come not so much on account of AMD’s travails, but the long-ago establishment of the x86 bus architecture as a kind of “ecosystem.” As long as Intel keeps manufacturing chipsets, there will continue to be motherboards — the well-established, pre-planned neighborhoods of the constituent parts of the x86 economy.

Disaggregation would break all that up. By decoupling once interdependent parts from one another, and placing each of them behind addresses in a neutral network, x86’s dominant position would no longer be assured by means of an unavoidable, non-substitutable bus. That gives Arm a way in.

So naturally, one of the companies establishing itself as an early leader in disaggregation is Intel, in an effort to get out in front of what might be called a “dynamic trend.”

“I think this is where we’ve clearly seen SmartNICs take off: in these cloud environments,” remarked Brad Burres, chief architect at Intel’s Data Platforms Group, speaking at a recent VMware conference. “There is this building of capabilities that the top hyperscalers have already started doing, because of offloading your virtual network, accelerating storage, and providing the isolation capability in security. This is really what’s driving it at the hyperscale level. Like all of the other technologies, they build, [and] it’s going to start filtering down to the rest of the world. Smaller clouds will start building it; enterprises will start leveraging it. I think you’ll see that waterfall effect, just like many other technologies we’ve seen.”

Arm stands to gain from the effort, even if it doesn’t quite come to full fruition. Moving security functions off of the same chain of events as user applications and onto a fully programmable processor, can only benefit the data center even if the SmartNIC remains a card that sits on a PCIe interface bus.

The bigger picture of disaggregation

There was a day when aggregation was the trend in data center server design. In fact, you could say that the whole reason x86 architecture (originally created for PCs) became so widely and suddenly prevalent among servers, was so design patterns could be standardized, manufacturing processes could be automated, and costs lowered.

So hearing manufacturers and software vendors talking about the virtues of disaggregation being essentially the same, is somewhat jarring to anyone with a long history of listening to manufacturers’ value propositions and sales pitches. We’re told what the trends should be, usually based on what benefits the vendor first, and we’re pushed to pitch those trends as prevalent and relevant. (One vendor catch phrase in my collection is “dynamic trend,” which is an oxymoron. If it were dynamic, it wouldn’t be a trend.)

jensen-huang-ceo-nvidia-in-his-kitchen.jpg

“The compute virtualization trend that enabled easier resource pooling and management, has now extended to networking, storage, and security,” explained Jensen Huang, speaking to attendees of his company’s conference in early October from his kitchen, amid a collection of pepper grinders and a bouquet of spatulas.

“What used to be dedicated hardware appliances,” Huang continued, “are now software services running on CPUs. The entire data center is software-programmable, and can be provisioned as a service. Virtual machines send packets across virtual switches and virtual routers. Perimeter firewalls are virtualized, and can protect every node. Microsegmentation secures east/west communications, so attackers can’t move laterally across the data center.”

It sounds almost like a carbon copy of the argument in favor of relocating the network control plane off dedicated appliances and onto CPU threads. But the argument now is that, like graphics in the 1980s, data processing has become such an intensive job that it deserves its own separate processor.

That’s the status of this assertion at the moment: an argument. Sometimes it takes good arguments to make good products. But in the history of information technology, in the absence of arguments, trends have still happened. The most oft-repeated pattern in IT trends has been about convergence — when systems come together, when platforms are established, when standards are agreed upon, when cooperation leads to openness. Left unto themselves, systems coalesce as if drawn together by gravity.

Disaggregation happens when the coalescence of systems creates a logjam blocking one’s path to a potential revenue stream. It’s the dynamite vendors use to blow up an existing platform. Never, not once, in the history of IT, has disaggregation come about as a natural course of events. It can work, but it takes a concerted effort, and an obvious payoff once it’s done. Yet even when the benefits seem obvious enough, the attempt has failed before.

This could be the start of something big, as the song goes. Or, like most of the ideas ever presented at IT conferences, it could end up just the start of something.

Learn More — From CNET Networks

Elsewhere

Source Article