Next-Gen HPC: Abacus Semiconductor's Game-Changing Innovations

50 Most Admired Companies to Watch 2024

CIO Bulletin

StreetLight, transportation analysis platform

Redefining the Future of High-Performance Computing, Abacus Semiconductor Corporation is Breaking Boundaries with Next-Generation Processor and Memory Innovations

“Our improved internals of processors, accelerators, and memory, combined with unified and higher-performance interfaces, will overcome many of the limitations users experience today and transform the landscape of high-performance computing.”

In the world of advanced computing, where the quest for greater efficiency and power drives innovation, a fundamental challenge persists: the limitations imposed by traditional architectures. The von Neumann architecture, despite its long-standing use, continues to hinder the full potential of high-performance computing (HPC) systems. Addressing this challenge is not just about incremental improvements; it requires a revolutionary shift in how processors and memory are designed and integrated.

Abacus Semiconductor Corporation stands at the forefront of this revolution. As a fabless semiconductor company, Abacus is transforming the field with its cutting-edge designs of processors, accelerators, and smart multi-homed memories for the next generation of supercomputers and high-performance compute (HPC) applications. Their innovative approach addresses the long-standing bottleneck of the von Neumann architecture, which has hindered the full realization of supercomputer potential. By re-engineering the fundamental components of HPC and Large Language Models for Artificial Intelligence (AI), Abacus Semiconductor is paving the way for a future where the full potential of computational power can be realized, setting new benchmarks in the industry.

At CIO Bulletin, we had the privilege of interviewing Axel Kloth, President and CEO of Abacus Semiconductor Corporation. He discussed how he and his team are on a mission to expand the use of high-performance computing (HPC), AI, and machine learning (ML) training with large language models (LLMs) by making these technologies more affordable, user-friendly, and energy-efficient, while also enhancing performance and cybersecurity.

Interview Highlights

Q. What inspired you to create Abacus Semiconductor Corporation, and how do you maintain your innovative edge in such a competitive industry?

I created Abacus Semiconductor Corporation because I became frustrated with the current state of the art in HPC and in the training aspects of AI, particularly for Generative AI. While Central Processing Unit (CPU) designers have made tremendous progress in improving processors, we do not see a scale-out effect that is nearly linear. This means that using 100K CPUs with vastly more than 1 million cores does not yield 100K times the performance of a single CPU. Often, the aggregate performance of 100K CPUs is less than 10K times that of a single CPU. This discrepancy is a waste of money, space, operating cost, and effort, and we believe it is unnecessary. When General Purpose Graphics Processing Units (GPGPUs) were added to the mix with CPUs, there was some improvement, but only marginally. The reality is that the interface between a CPU and a GPGPU does not meet today’s expectations or match the performance of both the CPU and the GPGPU. In other words, there is no efficient way to connect CPUs with each other, and CPU-to-GPGPU connections are even worse. Since no one else was willing to tackle this problem, I felt compelled to take action, and thus Abacus Semiconductor Corporation was born.

Q. Abacus Semiconductor Corporation is known for rethinking traditional CPU architectures with your beyond-von-Neumann and beyond-Harvard architectures. Can you elaborate on the inspiration behind this transformative vision and how it sets you apart in the semiconductor industry?

Yes, it goes back to the lack of interconnection between the components in a computer, whether within a server, across servers, or between accelerators and the servers. Imagine you're sitting at a desk with a task you can't solve, but your desk mate can. Handing it off is as simple as passing it over after asking them. Now, if this desk mate is on a different floor, you need to walk there. During that time, you'll be busy but unproductive. It's not just about how much data and what instructions you can transfer; it's also about the time it takes to get there, which is called latency. Having little data and lots of instructions, combined with high latency (the time during which you are busy and unproductive), hinders productivity. Scale-out efficiency decreases with increasing distance and thus latency.

When computers were first conceived, CPUs were slow, and so was their memory. The von Neumann architecture, which initially involved just input, a processor, and output, was adequate. Later, a memory port was added for storing interim results and fetching instructions. As CPUs improved and became faster each year, memory became denser and larger, but not faster. To address this, CPU designers invented an intermediate memory called a cache, which hides memory latency from the CPU and uses memory bandwidth to pre-fetch data. Caches must have a certain hit rate to be effective, and the patterns for fetching instructions differ significantly from those for arbitrary data. This is where the Harvard architecture comes in, providing separate interfaces on a CPU core for the first-level caches for instructions and data. Making these interfaces independent improved cache hit rates and, thus, CPU performance.

For the past 30 years, CPU designers have focused on improving CPU performance, and they have done a tremendous job. However, no matter how many CPU cores we can place into a processor, our computational challenges have grown even faster. This means that a single CPU or GPGPU cannot solve the problems we face, especially in HPC and large-scale AI applications, particularly in training. As such, the focus must shift towards system architecture, which requires improving the connectivity between CPU cores and any accelerator cores in the system. Additionally, there needs to be improved access to shared memory when required by the software, or when large-scale data transfers from CPU cores to accelerator cores are not practical or feasible. This is what Abacus Semiconductor Corporation is focused on addressing.

Q. Your technology addresses the gap between theoretical peak performance and real-life performance in supercomputers. What are the key innovations in your processors, accelerators, and memory subsystems that enable this breakthrough?

The theoretical peak performance of a computer is measured using benchmarks such as SGEMM, DGEMM, BLAS, and Linpack. A similar set of benchmarks exists for AI, ML, and the training and inference aspects of AI. These benchmarks are standardized measurements of specific throughput numbers, solving a predetermined set of operations. Often, they are designed so that each subsection of the total system's performance demand can be executed on one CPU, fitting exactly in the L1 to L3 caches of the processor, and with a total dataset size that a single GPGPU can handle as an accelerator for the benchmark application.

In real life, no one evaluates upfront whether the problem that needs to be solved will exhaust the CPU cache or the total memory attached to the accelerator, or whether the latency between the CPU core and GPGPU is sufficient. Programmers just want their problem solved as quickly as possible, regardless of the dataset size. Users are even more removed from the hardware, as they don’t code at all; they simply use the combined solution of hardware and software. Consequently, most users have resigned themselves to accept that the status quo cannot be overcome.

We disagree and believe that we have a solution to substantially improve upon what is currently available. Our improved internals of processors, accelerators, and memory, combined with unified and higher-performance interfaces, will overcome many of the limitations users experience today.

Q. The concept of a Server-on-a-Chip seems revolutionary for web services and high-transaction applications. Can you explain how this technology enhances integration and energy efficiency compared to traditional server architectures?

Today’s computational demands can be categorized into at least three groups. The first group is what the industry refers to as LAMP traffic. LAMP stands for Linux, Apache, MySQL, and PHP/Perl. Most internet traffic to and from websites and between servers serving data to end users is LAMP traffic. It is generic and independent of the Instruction Set Architecture (ISA) of the processor in the server. Since it consists of interpreted traffic and code, it does not require a specific processor type. Any server with an x86-64, ARM, or RISC-V processor can handle LAMP traffic. Estimates suggest that between 75% and 80% of all traffic is LAMP traffic. We have optimized our Server-on-a-Chip to execute LAMP traffic at the highest performance levels while maintaining the lowest power consumption. The chip includes many accelerators that enhance data transfer functions, so the CPU core does not need to handle these tasks in software. It also handles network and mass storage offload efficiently. Dedicated hardware for executing repetitive and simple tasks is faster and more energy-efficient than a CPU, allowing us to achieve higher performance under LAMP applications with lower power consumption. Additionally, we have incorporated a high-performance interface into the Server-on-a-Chip to enable system manufacturers to build multi-processor servers without additional complex ASICs. This also allows the Server-on-a-Chip to utilize our HRAM for improved memory performance and to share large-scale datasets without needing to copy the data.

The remaining 20% of traffic and computational demands involve processing large-scale datasets with advanced math. This can be specific to the processor's ISA. Many engineering applications were originally written for x86-64 from Intel or AMD. The Server-on-a-Chip cannot run these applications due to ISA incompatibility, and emulating math functions on the Server-on-a-Chip will not yield the desired performance. However, code written in High-Level Languages (HLL) that can be compiled and uses Application Programming Interfaces (APIs) such as OpenCL, OpenACC, or TensorFlow can be recompiled for the Server-on-a-Chip in conjunction with our math accelerators. The remaining computational needs are addressed by OLTP and database applications, where the Server-on-a-Chip excels due to its scale-out and memory architecture.

Q. How does Abacus Semiconductor address energy efficiency and sustainability in your designs?

If a computational problem can be solved with fewer processors and fewer memory subsystems, less power is consumed compared to existing technologies. With the same amount of power as used in existing technology, vastly larger problems can be solved. That is why we use our efficiency advantage to promote better sustainability.

Q. What are the next milestones or future products that Abacus Semiconductor aims to achieve in the coming years, particularly in the fields of AI, ML, and Big Data?

We are building out the team to implement the proof of concept in a number of ASICs so that we can demonstrate our advantage not only through simulation or emulation models but also on real hardware. We are also working with partners and the industry to find novel ways to reduce computational complexity in GenAI. Collaborations include efforts to reduce complexity in matrix and tensor operations, as well as in various types of transforms. Often, we find that the underlying math is quite simple, but the way the data is structured prevents it from being easily parallelized.

The Ardent Visionary Behind Abacus Semiconductor Corporation’s Success

Axel Kloth is the Founder, President and CEO of Abacus Semiconductor Corporation. A post-graduate in physics and computer science, Mr. Kloth has accomplished the undoable numerous times, pushing the boundaries of what's possible in computing technology.