• Menu
  • Skip to right header navigation
  • Skip to main content
  • Skip to primary sidebar

DigiBanker

Bringing you cutting-edge new technologies and disruptive financial innovations.

  • Home
  • Pricing
  • Features
    • Overview Of Features
    • Search
    • Favorites
  • Share!
  • Log In
  • Home
  • Pricing
  • Features
    • Overview Of Features
    • Search
    • Favorites
  • Share!
  • Log In

GenAI shifts from training to inference infrastructure;  75% compute demand by 2030 requires millisecond-critical latency optimization for production deployment

September 24, 2025 //  by Finnovate

Inference infrastructure is emerging as a distinct layer in the generative AI stack, bridging compute and applications.  Here are six things to be aware of: 1) The Shift From Training to Inference: AI’s spotlight has long been on training, with companies amassing data and building larger models. The real challenge now is inference: running those models in production, serving billions of queries and delivering instant results. 2) What Inference Really Means: Training is when a model learns from massive datasets on high-powered hardware. Inference is when a trained model is applied to new inputs in real time. It powers everything from ChatGPT prompts to fraud checks and search queries. This constant, real-time activity never stops. Keeping ChatGPT online alone reportedly costs OpenAI tens of millions of dollars per month. 3) The Scale of Demand: Generative AI has moved from research to mainstream use, creating billions of inference events daily. As of July 2025, OpenAI reported handling 2.5 billion prompts per day, including 330 million from U.S. users. Brookfield forecasts suggest that 75 percent of all AI compute demand will come from inference by 2030. 4) Why Infrastructure Matters: Unlike training, inference is the production phase. Latency, cost, scale, energy use and deployment location all determine whether an AI service works or fails. Optimized infrastructure spans computing, networking, software and deployment strategies to keep predictions reliable at scale. 5) Latency Is Business-Critical: Milliseconds make or break user experience. A delay can frustrate chatbot users, or worse, prevent a fraud detection system from stopping a fraudulent payment in time. Every millisecond counts when millions of customers are involved. 6) Cutting Costs With Optimization: Inference is a recurring operating expense, not a one-time investment. Providers rely on optimization techniques to lower costs without sacrificing accuracy: Batching: processing multiple requests at once; Caching: reusing frequent results; Speculative decoding: letting a smaller model draft quick answers before a larger one verifies them; Quantization: reducing numerical precision to cut compute and energy use.

Read Article

Category: Additional Reading

Previous Post: « Salt Security introduces MCP Protect and Agentic AI Governance controls integrated with CrowdStrike SIEM to secure proliferating agent-driven API interactions
Next Post: Due to consumerized cross-border payments, corporate treasury upgrades to wallets, local rails and 24/7 clearing along with real-time AML/KYC screening »

Copyright © 2025 Finnovate Research · All Rights Reserved · Privacy Policy
Finnovate Research · Knyvett House · Watermans Business Park · The Causeway Staines · TW18 3BA · United Kingdom · About · Contact Us · Tel: +44-20-3070-0188

We use cookies to provide the best website experience for you. If you continue to use this site we will assume that you are happy with it.