With the dissipation of Moore’s Law and the rise of machine learning, the tech world has increasingly turned its focus from CPUs to coprocessors, such as GPUs, DSPs, FPGAs and neural net accelerators. At the same time, innovative new technologies continue to expand the possibilities for memory and storage.
Despite high demand for such technologies, integration challenges have slowed their entry into datacenters. “There are a lot of new hardware innovations, such as AI chips, that companies want to embrace, but they usually require that you replace your servers,” says Yiying Zhang, assistant professor of electrical and computer engineering. “Because these new technologies are designed for specific servers, adoption in datacenters has been slow.”
Zhang, who is the director of WukLab, has developed a novel solution that in a literal sense involves thinking outside of the box. Her LegoOS is not only a distributed OS, but also “disaggregated and decomposed.” Unlike existing distributed operating systems, such as Mach or Sprite, which span multiple monolithic servers, LegoOS is designed to get the most out of a disaggregated network of single-purpose hardware devices.
LegoOS assumes the coming deconstruction of monolithic servers into physically separated devices, such as CPU, GPU, DRAM, SSD and HDD. “We are breaking the computer apart,” says Zhang. “There is no computer. Everything operates on its own and is attached to the network.”
With a network of systems running LegoOS, no single CPU node would control the others, and no memory would be shared across processors. Every hardware unit — whether designed for chips, memory or storage — would have its own controller and network interface. The controllers would communicate with each together across a high-bandwidth, 200Gbps network with low, <1usec latency, such as networks based on the latest InfiniBand technology.
The new architecture would enable easy swap-outs of sub-systems rather than entire servers, making it easier to add and upgrade new technologies without downtime, says Zhang. “There are times when you may need more memory while other times you need more processing,” she says. “Today, if you need to scale a network of monolithic servers in a particular area, such as memory, you may have to buy new servers, even if you don’t need all the other components that come with them.”
As a result, datacenters often do without the resources they need, or else over-invest in resources that often sit idle. This requires considerable energy and money for power and cooling in addition to the cost of the hardware itself.
Existing distributed OSes are designed to share access to resources available on monolithic servers across the network, but they do not do this very efficiently, says Zhang. “If you want to run an application that needs a special processor, you need to request it from the monolithic server containing that processor.”
A fully distributed and disaggregated system would reduce redundancy and offer more efficient and fluid resource usage, says Zhang. Among other benefits, it could more easily integrate chips from different architectures such as Arm and x86.
Improved failure handling is another benefit of a disaggregated system. “On a monolithic server, if the OS crashes, all the apps on that server crash,” says Zhang. “But if resources are separated, then other applications can continue.”
Speeding Networking with Enhanced RDMA
LegoOS is only possible due to the accelerating bandwidth of networking technology. “The good news is that networks have become orders of magnitude faster,” says Zhang. “Yet, if you break servers apart and increase scale, networking becomes the burden.”
To bring network speed closer to that of the interconnects within a computer, Zhang is working with RDMA (remote direct memory access), a networking technology that enables direct access from the memory in one computer to memory on another without requiring the involvement of an OS. The RDMA protocol can be used as an HPC interconnect with benefits including low latency, high throughput and low CPU utilization.
Zhang’s WukLab has launched two projects designed for both LegoOS and existing distributed operating systems that aim to better integrate RDMA and NVM (nonvolatile memory), respectively. One project is developing a kernel-level indirection layer for RDMA running on Linux called “LITE” (Local Indirection TiEr). LITE uses a thin layer of virtualization to add a more flexible, high-layer abstraction layer. This should enable greater scalability, improved resource sharing, and more flexible protection against failure without requiring kernel bypass, says Zhang.
“RDMA is faster than Ethernet, but it is not specifically designed for datacenter environments or large-scale distributed systems,” she explains. “LITE makes RDMA more scalable with better performance.”
The NVM project, meanwhile, is developing a Distributed Shared Persistent Memory (DSPM) framework that enables persistent memories such as NVMs to work in distributed datacenter environments. DSPM adds an abstraction layer that allows applications to name, share and “persist” data while also performing traditional memory load and store instructions.
“NVMs such as Intel’s Optane are much faster than other storage technologies, and you can attach them directly to main memory,” says Zhang by way of explaining the importance of DSPM to LegoOS.
Building More Robust Controllers
A disaggregated LegoOS system would require redesigning some of today’s hardware so that devices can be cleanly separated and operate on their own. For example, to separate memory from a processor, the processor needs to communicate only with virtual memory addresses, which means traditional CPU caches would need to be virtually indexed and virtually tagged. In addition, a processor would be given a small allotment of local DRAM that LegoOS would manage as an additional level of cache.
Among the other hardware changes required by LegoOS, the controllers found in today’s memory and storage systems will need to be much more sophisticated. “Today’s device controllers still rely on the CPU to run an operating system,” says Zhang.
Once the networking and hardware challenges are overcome, the LegoOS project will focus on the biggest challenge: software. The current version of LegoOS aims to support the same user interface as Linux since most datacenter applications are built on top of Linux. But the internals of LegoOS are completely different from those of Linux in that the kernel is split into multiple stateless managers for each component on the network. Each manager communicates with each device controller by way of monitor software that runs on each device.
“We are separating the OS functionality into different pieces, connected with a networking stack and other layers,” says Zhang. “The goal is that an application will think it’s running on a monolithic server. We will have a light, thin layer we’re calling a V-Node that provides virtualized application translation. We will map each V-Node to a different set of hardware sources.”
The current version of LegoOS is designed to run a single type of processor, memory technology, and storage monitor, and can run unmodified datacenter applications like Google’s TensorFlow. Yet, “it will eventually be more heterogeneous,” says Zhang. “So far, the performance is better than I expected.”