Israeli chipmaker Hailo Technologies has released the Hailo-10H, the first discrete AI accelerator designed for generative AI workloads at the edge. The device runs large language models (LLMs), vision-language models (VLMs), and other multi-modal AI directly on-device, eliminating the need for cloud-based inference. The Hailo-10H offers unmatched power efficiency and low latency, achieving first-token generation in under one second and maintaining 10 tokens per second on 2-billion parameter LLMs. It can also generate images with Stable Diffusion 2.1 in under five seconds, demonstrating a significant leap forward for offline generative workloads. The chip is designed around Hailo’s second-generation neural core architecture, providing 40 tera-operations per second (TOPS) of INT4 performance and 20 TOPS of INT8 at a typical power draw of 2.5 W. It is fully compatible with TensorFlow, PyTorch, ONNX, and Keras, and is supported by Hailo’s mature software stack. The device is designed to work in hybrid AI pipelines that blend LLMs or VLMs with traditional convolutional neural networks (CNNs), conserving power and ensuring real-time responsiveness for mission-critical applications like video analytics.