LFM2-VL, a new generation of vision-language foundation models can deploy across a wide range of hardware — from smartphones and laptops to wearables and embedded systems promising low-latency performance, strong accuracy, and flexibility for real-world applications • DigiBanker

Liquid AI has released LFM2-VL, a new generation of vision-language foundation models designed for efficient deployment across a wide range of hardware — from smartphones and laptops to wearables and embedded systems. The models promise low-latency performance, strong accuracy, and flexibility for real-world applications. According to Liquid AI, the models deliver up to twice the GPU inference speed of comparable vision-language models, while maintaining competitive performance on common benchmarks. The release includes two model sizes: LFM2-VL-450M — a hyper-efficient model with less than half a billion parameters (internal settings) aimed at highly resource-constrained environments. LFM2-VL-1.6B — a more capable model that remains lightweight enough for single-GPU and device-based deployment. Both variants process images at native resolutions up to 512×512 pixels, avoiding distortion or unnecessary upscaling. For larger images, the system applies non-overlapping patching and adds a thumbnail for global context, enabling the model to capture both fine detail and the broader scene. Unlike traditional architectures, Liquid’s approach aims to deliver competitive or superior performance using significantly fewer computational resources, allowing for real-time adaptability during inference while maintaining low memory requirements. This makes LFMs well suited for both large-scale enterprise use cases and resource-limited edge deployments. LFM2-VL uses a modular architecture combining a language model backbone, a SigLIP2 NaFlex vision encoder, and a multimodal projector. The projector includes a two-layer MLP connector with pixel unshuffle, reducing the number of image tokens and improving throughput. Users can adjust parameters such as the maximum number of image tokens or patches, allowing them to balance speed and quality depending on the deployment scenario. The training process involved approximately 100 billion multimodal tokens, sourced from open datasets and in-house synthetic data.

Read Article