There is multiple technique to handle AI fine-tuning, training, and inference at the sting.
Options besides a GPU include using a Neural Processing Unit (NPU) from the silicon manufacturer Kneron.
At the Computex conference in Taiwan today, Kneron unveiled its next generation of silicon and server technology designed to power edge AI inference and fine-tuning. Founded in 2015, Kneron counts Qualcomm and Sequoia Capital amongst its investors. In 2023, the corporate announced its KL730 NPU to handle the worldwide GPU shortage. Now, Kneron is launching its next generation KL830 and giving a glimpse of the longer term KL 1140, which is scheduled to launch in 2025. In addition to recent NPU silicon, Kneron can also be expanding its AI server portfolio with the KNEO 330 Edge GPT server, which enables offline inference capabilities.
Kneron's technology is a component of a small but growing variety of vendors, including Groq and SambaNova, trying to leverage technology aside from a GPU to enhance the performance and efficiency of AI workloads.
Edge AI and personal LLMs with NPU support
A growing focus for Kneron with its update is the enabling of personal GPT servers that might be run locally.
Instead of a corporation having to depend on a big system with cloud connectivity, a personal GPT server can run locally at the sting of a network for inference. That is the promise of the Kneron KNEO system.
Kneron CEO Albert Liu told VentureBeat that the KNEO 330 system integrates multiple KL830 Edge AI chips and is a small form factor server. The promise of the system, Liu said, is that it enables low-cost on-premises GPT deployments for enterprises. The predecessor system, the KNEO 300, powered by the KL730, is already in use at large organizations, including Stanford University in California.
The KL830 chip, which sits between the corporate's existing KL730 and the upcoming KL1140, is designed specifically for voice models and might be cascaded to support larger models while maintaining low power consumption.
While Kneron focuses on hardware, software also plays a task.
Kneron now has quite a few ways to coach and fine-tune models that run on the corporate's hardware. Liu said Kneron combines several open models after which fine-tunes them to run on NPUs.
Kneron now also supports transferring trained models to their chips via a neural compiler. This tool allows users to transfer models trained with frameworks akin to TensorFlow, Caffe or MXNet and compile them to be used on Kneron chips.
Kneron's recent hardware may also be used to support retrieval-augmented generation (RAG) workflows. Liu identified that in comparison with GPUs, Kneron's chips use a singular structure to cut back the memory footprint for the big vector databases required by RAG. This allows RAG to operate with less memory and power consumption.
Kneron’s secret recipe: low power consumption
One of the principal differences between Kneron's technology and the technology itself is its low power consumption.
“I believe the principal difference is that our electricity consumption is so low,” Liu said.
According to Kneron, the brand new KL830 has a peak power consumption of a measly 2 watts. Despite this low power consumption, the KL830 offers a consolidated computing power (CCP) of as much as 10eTOPS@8bit, in keeping with the corporate.
Liu said the low power consumption allows Kneron's chips to be integrated into various devices, including PCs, without the necessity for added cooling solutions.