Together AI announces ATLAS adaptive speculator system delivering 400% inference speedup using dual-speculator architecture combining heavyweight static model trained on broad data with lightweight adaptive model learning continuously from live traffic patterns in real-time • DigiBanker

Together AI announced research and a new system called ATLAS (AdapTive-LeArning Speculator System) that aims to help enterprises overcome the challenge of static speculators. The technique provides a self-learning inference optimization capability that can help to deliver up to 400% faster inference performance than a baseline level of performance available in existing inference technologies such as vLLM. The system addresses a critical problem: as AI workloads evolve, inference speeds degrade, even with specialized speculators in place. ATLAS uses a dual-speculator architecture that combines stability with adaptation: The static speculator – A heavyweight model trained on broad data provides consistent baseline performance. It serves as a “speed floor.” The adaptive speculator – A lightweight model learns continuously from live traffic. It specializes on-the-fly to emerging domains and usage patterns. The confidence-aware controller – An orchestration layer dynamically chooses which speculator to use. It adjusts the speculation “lookahead” based on confidence scores. The technical innovation lies in balancing acceptance rate (how often the target model agrees with drafted tokens) and draft latency. As the adaptive model learns from traffic patterns, the controller relies more on the lightweight speculator and extends lookahead. This compounds performance gains. Together AI’s testing shows ATLAS reaching 500 tokens per second on DeepSeek-V3.1 when fully adapted. More impressively, those numbers on Nvidia B200 GPUs match or exceed specialized inference chips like Groq’s custom hardware.

Read Article