Hugging Face’s SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs with unprecedented efficiency: it requires only 5.02 GB of GPU RAM
We use cookies to provide the best website experience for you. If you continue to use this site we will assume that you are happy with it.OkayPrivacy policy