• Menu
  • Skip to right header navigation
  • Skip to main content
  • Skip to primary sidebar

DigiBanker

Bringing you cutting-edge new technologies and disruptive financial innovations.

  • Home
  • Pricing
  • Features
    • Overview Of Features
    • Search
    • Favorites
  • Share!
  • Log In
  • Home
  • Pricing
  • Features
    • Overview Of Features
    • Search
    • Favorites
  • Share!
  • Log In

Together AI announces ATLAS adaptive speculator system delivering 400% inference speedup using dual-speculator architecture combining heavyweight static model trained on broad data with lightweight adaptive model learning continuously from live traffic patterns in real-time

October 14, 2025 //  by Finnovate

Together AI announced research and a new system called ATLAS (AdapTive-LeArning Speculator System) that aims to help enterprises overcome the challenge of static speculators. The technique provides a self-learning inference optimization capability that can help to deliver up to 400% faster inference performance than a baseline level of performance available in existing inference technologies such as vLLM. The system addresses a critical problem: as AI workloads evolve, inference speeds degrade, even with specialized speculators in place. ATLAS uses a dual-speculator architecture that combines stability with adaptation: The static speculator – A heavyweight model trained on broad data provides consistent baseline performance. It serves as a “speed floor.” The adaptive speculator – A lightweight model learns continuously from live traffic. It specializes on-the-fly to emerging domains and usage patterns. The confidence-aware controller – An orchestration layer dynamically chooses which speculator to use. It adjusts the speculation “lookahead” based on confidence scores. The technical innovation lies in balancing acceptance rate (how often the target model agrees with drafted tokens) and draft latency. As the adaptive model learns from traffic patterns, the controller relies more on the lightweight speculator and extends lookahead. This compounds performance gains. Together AI’s testing shows ATLAS reaching 500 tokens per second on DeepSeek-V3.1 when fully adapted. More impressively, those numbers on Nvidia B200 GPUs match or exceed specialized inference chips like Groq’s custom hardware.

Read Article

Category: AI & Machine Economy, Innovation Topics

Previous Post: « Blockchain, AI and Web3 convergence enables decentralized user-centric ecosystems with greater data and digital asset control; supporting AI-driven chatbots, virtual assistants and programmable digital assets that enable personalization and resilient financial services
Next Post: AI factory success should be measured by enabling scalable intelligence CIOs use for measurable outcomes rather than engineering marvels, with vendors responding to enterprise requirements for infrastructure translating innovation into usable value »

Copyright © 2025 Finnovate Research · All Rights Reserved · Privacy Policy
Finnovate Research · Knyvett House · Watermans Business Park · The Causeway Staines · TW18 3BA · United Kingdom · About · Contact Us · Tel: +44-20-3070-0188

We use cookies to provide the best website experience for you. If you continue to use this site we will assume that you are happy with it.