First Processor IP Core or Consumer Device Silicon Product Known to Run Llama2
BURLINGAME, Calif.–(BUSINESS WIRE)–#Chimera–Quadric® today announced that support for the Llama 2 large language model (LLM) is immediately available on its ChimeraTM general purpose neural processing unit (GPNPU) intellectual property (IP) core. Unlike other IP and semiconductor application processor suppliers, Quadric was able to add this support with a simple software port with no hardware changes, so existing designs can immediately run this model. Other suppliers have announced plans to change their hardware to offer support in 2024 or beyond.
Quadric Beats Other Timelines for Support
Meta introduced the Llama2 LLM for generative artificial intelligence (AI) on July 18 of this year. Coincident with the unveiling of Llama2, Meta and Qualcomm announced a partnership to port Llama2 to future Qualcomm Snapdragon chips expected in 2024 smartphones and laptops. Use of LLMs had previously been considered to only be viable in cloud datacenters. Meta’s announcement set off a flurry of activity from chip and IP providers to race to capture market attention – and investment dollars – for on-device LLM implementations.
In addition to Qualcomm’s announced timeline of greater than six months to port Llama2, fellow silicon company Mediatek announced four weeks later that it, too, was working on Llama2 support with an expectation that 2024 mobile phones powered by Mediatek flagship silicon would support Llama2. Also in August, IP licensor Ceva announced that a redesigned IP core would soon be delivered to its customers that could support LLMs. Cadence similarly unveiled new IP in September to support LLMs. Those Ceva and Cadance customers will need to license new cores and design new chips for delivery sometime in 2025 or 2026. Notably, all four chip and IP announcements directly or indirectly imply that silicon respins are required to gain this new capability.
“Why would these titans of the semiconductor and IP worlds need to wait until 2024 or 2025 or beyond to support today’s newest, hottest ML model?” noted Quadric CMO, Steve Roddy. “Why would a new machine learning model force a silicon respin – typically costing in excess of $100M – to be able to run new ML code? SoCs with Quadric’s Chimera GPNPU are ready to run Llama2 today!”
Why Not Just Port to the CPU?
Existing silicon chips for consumer devices all have leading-edge applications-class CPUs. Many have eight ostensibly high-performance CPUs. Why wouldn’t SoC vendors simply port Llama2 to the CPU, which can run any ML model? Clearly either the performance won’t meet consumer expectations, or the power consumed will kill device battery life. Hence the need to respin the ML accelerator to run LLMs. Or why not choose a fully programmable GPU from a leading ML vendor such as Nvidia? Perhaps the 10 Watt power dissipation and the need for a cooling fan prohibits use in a smartphone?
Like those easily programmable but power-hungry CPU and GPU solutions, Quadric’s GPNPU is also easily programmable. But only Quadric Chimera GPNPUs deliver programmability with the power-performance profile needed in today’s portable consumer devices.
Quadric’s processor architecture uniquely combines the best attributes of C++ programmability – the ability to run any ML model – with the performance efficiency of NPU accelerators found in many first-generation SoCs in the market today. But unlike inflexible accelerators that force silicon respins when complex new models such as Llama2 are invented, Chimera cores are fully programmable. Chimera GPNPUs run any model. All of the model – all of the layers. No removal of problematic layers. No partitioning. No forcing the data scientist to convert convolutions to adhere to the limited subset of conv types supported in hardware. Any model, any network, any operator.
Fast Porting, Low Effort
Quadric’s team invested a total of 13 engineer-weeks over a total of 4 elapsed weeks to port an INT8 quantized version of Llama2 to the Chimera platform and tune performance. Two new ML operator layers, and two variants of existing operator kernels, were coded in C++ by the Quadric applications team to get the model running. A further two engineer-weeks investment ironed out corner case performance and accuracy tweaks to ensure operation across all three sizes of the Chimera QB series processors (1 TOPs, 4 TOPs and 16 TOPs variants). Other machine learning inference solution providers with much larger teams are still struggling to meet six-month porting targets!
Quadric’s Chimera QB4 4 TOPs GPNPU running Llama2 15M delivers 225 Token/Sec/Watt efficiency in a 5nm technology, while occupying only 2.5 mm2.
For comparison, an M1 Pro laptop’s highest performance single CPU delivers only 11 Token/Sec/W running the same Int8 version of Llama2. (Int8 model running ONNX runtime). Quadric delivers 20X improvement in ML inference per Watt compared to a leading-edge CPU!
Visit Quadric.io to learn more.
Quadric Inc. is the leading licensor of general-purpose neural processor IP (GPNPU) that runs both machine learning inference workloads and classic DSP and control algorithms. Quadric’s unified hardware and software architecture is optimized for on-device ML inference. Learn more at www.quadric.io.