Scaling next-gen AI would require a shift towards tightly integrated, domain-specific and compute-centric networking with specialized interconnects to support direct memory-to-memory transfers and use dedicated hardware to speed information sharing among processors • DigiBanker

Fulfilling the promise of AI requires a step-change in capabilities far exceeding the advancements of the internet era. To achieve this, we as an industry must revisit some of the foundations that drove the previous transformation and innovate collectively to rethink the entire technology stack. We are now witnessing a decisive shift towards specialized hardware — including ASICs, GPUs, and tensor processing units (TPUs) — that deliver orders of magnitude improvements in performance per dollar and per watt compared to general-purpose CPUs. This proliferation of domain-specific compute units, optimized for narrower tasks, will be critical to driving the continued rapid advances in AI. These specialized systems will often require “all-to-all” communication, with terabit-per-second bandwidth and nanosecond latencies that approach local memory speeds. To scale gen AI workloads across vast clusters of specialized accelerators, we are seeing the rise of specialized interconnects, such as ICI for TPUs and NVLink for GPUs. These purpose-built networks prioritize direct memory-to-memory transfers and use dedicated hardware to speed information sharing among processors, effectively bypassing the overhead of traditional, layered networking stacks. This move towards tightly integrated, compute-centric networking will be essential to overcoming communication bottlenecks and scaling the next generation of AI efficiently. Traditional fault tolerance relies on redundancy among loosely connected systems to achieve high uptime. ML computing demands a different approach. First, the sheer scale of computation makes over-provisioning too costly. Second, model training is a tightly synchronized process, where a single failure can cascade to thousands of processors. Finally, advanced ML hardware often pushes to the boundary of current technology, potentially leading to higher failure rates. Instead, the emerging strategy involves frequent checkpointing — saving computation state — coupled with real-time monitoring, rapid allocation of spare resources and quick restarts. The underlying hardware and network design must enable swift failure detection and seamless component replacement to maintain performance. While traditional system design focuses on maximum performance per chip, we must shift to an end-to-end design focused on delivered, at-scale performance per watt. This approach is vital because it considers all system components — compute, network, memory, power delivery, cooling and fault tolerance — working together seamlessly to sustain performance.

Read Article