• Menu
  • Skip to right header navigation
  • Skip to main content
  • Skip to primary sidebar

DigiBanker

Bringing you cutting-edge new technologies and disruptive financial innovations.

  • Home
  • Pricing
  • Features
    • Overview Of Features
    • Search
    • Favorites
  • Share!
  • Log In
  • Home
  • Pricing
  • Features
    • Overview Of Features
    • Search
    • Favorites
  • Share!
  • Log In

Nvidia proposes “speculative decoding,” which uses a second, smaller model to guess what the main model will output for a given prompt in an attempt to speed it up

August 26, 2025 //  by Finnovate

Nvidia announced advances in artificial intelligence software and networking innovations aimed at accelerating AI infrastructure and model deployment. It unveiled Spectrum-XGS, or “giga-scale,” for its Spectrum-X Ethernet switching platform designed for AI workloads. Spectrum-X connects entire clusters within the data center, allowing massive datasets to stream across AI models. Spectrum-XGS extends this by providing orchestration and interconnection between data centers. “We’re introducing this new term, ‘scale across,’” said Dave Salvator, director of accelerated computing products at Nvidia. “These switches are basically purpose built to enable multi-site scale with different data centers able to communicate with each other and essentially act as one gigantic GPU.” Salvator said the system minimizes jitter and latency, the variability in packet arrival times and the delay between sending data and receiving a response. Dynamo is Nvidia’s inference serving framework, which is how models are deployed and process knowledge. Nvidia is also researching “speculative decoding,” which uses a second, smaller model to guess what the main model will output for a given prompt in an attempt to speed it up. “The way that this works is you have what’s called a draft model, which is a smaller model which attempts to sort of essentially generate potential next tokens,” said Salvator. Because the smaller model is faster but less accurate, it can generate multiple guesses for the main model to verify. “And we’ve already seen about a 35% performance gain using these techniques.” According to Salvator, the main AI model does verification in parallel against its learned probability distribution. Only accepted tokens are committed, so rejected tokens are discarded. This keeps latency under 200 milliseconds, which he described as “snappy and interactive.”

Read Article

Category: AI & Machine Economy, Innovation Topics

Previous Post: « Candescent’s Terafina adds configurable onboarding templates that shift optional steps post account ‑opening; boosting conversion while preserving compliance
Next Post: Edgen rolls out multi‑agent stock intelligence that generates sub‑second market reports, AI ratings, and price forecasts via its EDGM model, enabling rapid, executable equity decisions at scale »

Copyright © 2025 Finnovate Research · All Rights Reserved · Privacy Policy
Finnovate Research · Knyvett House · Watermans Business Park · The Causeway Staines · TW18 3BA · United Kingdom · About · Contact Us · Tel: +44-20-3070-0188

We use cookies to provide the best website experience for you. If you continue to use this site we will assume that you are happy with it.