Nvidia announced a new GPU called the Rubin CPX, designed for context windows larger than 1 million tokens. Part of the chip giant’s forthcoming Rubin series, the CPX is optimized for processing large sequences of context and is meant to be used as part of a broader “disaggregated inference” infrastructure approach. For users, the result will be better performance on long-context tasks like video generation or software development. Nvidia’s relentless development cycle has resulted in enormous profits for the company, which brought in $41.1 billion in data center sales in its most recent quarter. The Rubin CPX is slated to be available at the end of 2026. Inference consists of two distinct phases: the context phase and the generation phase, each placing fundamentally different demands on infrastructure. The context phase is compute-bound, requiring high-throughput processing to ingest and analyze large volumes of input data to produce the first token output result. In contrast, the generation phase is memory bandwidth-bound, relying on fast memory transfers and high-speed interconnects, such as NVLink, to sustain token-by-token output performance. Disaggregated inference enables these phases to be processed independently, enabling targeted optimization of compute and memory resources. This architectural shift improves throughput, reduces latency, and enhances overall resource utilization.