A private equity firm just built a cloud that no chipmaker would have designed on its own. On June 3, Vista Equity Partners and Cambium Capital flipped the switch on Vector Core Compute, a commercial inference service that splits a single language model query across three rival silicon architectures at once. Intel CPUs handle orchestration, Nvidia Blackwell GPUs run the prefill, and SambaNova RDUs run the decode. The pitch is blunt: no single chip wins every part of inference, so stop pretending one does.
What Actually Happened
Vector Core Compute, branded VC2, launched at Computex in Taipei as what its backers call the world's first commercially available enterprise inference cloud built for disaggregated inference. The architecture is the headline. Instead of routing an entire request through one accelerator, VC2 breaks each query into its natural phases and sends each phase to the hardware that runs it cheapest. Intel Xeon 6 processors handle orchestration and execution, Nvidia Blackwell GPUs handle the compute-heavy prefill stage, and SambaNova SN40 RDUs handle the memory-bound decode stage that generates tokens one at a time.
The money behind it is concrete. The launch rests on a $3.5 billion compute commitment to SambaNova, plus support from Intel, with Vista and Cambium as the financial sponsors. The first data center is live in Los Angeles, and VC2 has named expansion sites in Chicago, Seattle, and Phoenix, with a stated plan to reach more than 50 US metros. This is not a research demo. It is a capitalized, multi-region buildout aimed at enterprises that want inference capacity without buying their own racks.
The first customer is already named and it is itself a cloud. Together.ai is running production workloads on VC2's agentic cloud, and the companies claim it delivered the fastest enterprise inference on the MiniMax 2.5 model of any architecture measured to date. That a sophisticated inference provider like Together.ai would rent capacity from a brand-new disaggregated cloud, rather than scale its own GPU fleet, is the single most revealing detail in the announcement. It is a vote that the disaggregation thesis is real and not just a slide.
Why This Matters More Than People Think
The technical claim under VC2 is that the two halves of language model inference want opposite hardware. Prefill, where the model reads the prompt, is compute-bound and rewards raw matrix throughput, which is what a Blackwell GPU sells. Decode, where the model emits each new token, is memory-bandwidth bound and spends most of its time waiting on memory, which is where SambaNova's reconfigurable dataflow architecture claims an edge. Forcing both phases onto the same expensive GPU means you overpay for decode and starve prefill. Splitting them is an arbitrage on the mismatch.
If that arbitrage holds, it reframes the entire inference cost conversation that has dominated 2026. Companies have spent the year obsessing over price per million tokens, and every neocloud has pitched a cheaper GPU hour. VC2 is arguing the savings live in the architecture, not the discount. A cloud that runs decode on RDUs and prefill on GPUs could in principle undercut a pure-GPU provider on the same model without a price war, because its underlying cost structure is different rather than its margin thinner. That is a more durable advantage than racing rivals to the bottom.
There is also a structural signal in who built this. Vista is a buyout shop, not a hyperscaler and not a chip startup. A private equity firm underwriting a $3.5 billion silicon commitment and standing up data centers across four cities means compute is now being financed like infrastructure, the way toll roads and fiber once were. The people who once bought enterprise software companies are now buying the picks and shovels of the AI economy, and they are willing to mix vendors in ways the vendors themselves never would.
The timing sharpens the point. Through 2026, enterprises moved from experimenting with AI to running agents in production, and agentic workloads are decode-heavy by nature. An agent that plans, calls tools, reflects, and retries generates far more output tokens per request than a simple chatbot answer, which means the decode stage that VC2 offloads to SambaNova is exactly the part of the bill that is growing fastest. A cloud optimized for cheap decode arrives precisely as decode becomes the dominant cost, which is either shrewd timing or a lucky coincidence the founders will happily take credit for.
The Competitive Landscape
The neocloud field VC2 is entering is already crowded and well funded. Together.ai, which is paradoxically VC2's first customer, plus CoreWeave, Lambda, Crusoe, Nebius, Baseten, Fireworks, and DeepInfra have all raised large rounds to rent GPU capacity for inference. Almost every one of them is built on a homogeneous Nvidia fleet. VC2's differentiation is precisely that it refuses to be homogeneous, which means it is not really competing on who has more Blackwell GPUs but on whether a heterogeneous stack can beat a clean one on cost and latency.
The chip rivalry underneath is the more interesting fight. Nvidia's own Dynamo software already does prefill and decode disaggregation across GPUs, and Nvidia would much prefer customers solve the problem by buying more Nvidia parts. SambaNova, Groq, and Cerebras have spent years arguing that GPUs are the wrong tool for token generation. VC2 is the first commercial venue where those competing claims get settled with real enterprise traffic rather than vendor benchmarks, and SambaNova's $3.5 billion commitment means it has the most to gain or lose from the verdict.
The historical parallel is the unbundling of storage from compute a decade ago. For years servers shipped with local disks, and the prevailing wisdom said keeping storage close to compute was the only sane design. Then disaggregated storage arrays and object stores proved that separating the two and connecting them over fast networks was cheaper and more flexible at scale. Disaggregated inference is making the identical bet one layer up, wagering that the network cost of moving intermediate state between chips is smaller than the cost of running everything on the wrong silicon.
There is a supply and procurement angle too. Concentrating an entire national inference build on one vendor's GPUs leaves buyers exposed to allocation shortages and pricing power, a lesson every enterprise learned during the 2024 and 2025 GPU droughts. A cloud that can route decode to SambaNova silicon when Blackwell supply tightens, or shift prefill load as availability swings, sells resilience as much as speed. For a CIO who spent two years unable to get GPUs at any price, multi-sourcing the inference stack is a procurement strategy before it is ever a performance one.
Hidden Insight: The KV Cache Is the Whole Ballgame
The detail buried in the disaggregation pitch, and the one that decides whether VC2 wins or stalls, is the key-value cache. When prefill finishes on the GPU, it produces a large block of intermediate state that decode needs to keep generating tokens. In a single-chip design that state never moves. In VC2's design it has to travel from the Nvidia prefill tier to the SambaNova decode tier over the network, every request, at scale. The entire economic case depends on that transfer being faster and cheaper than the GPU time it saves.
This is why the Together.ai benchmark on MiniMax 2.5 matters more than the marketing language around it. A model like MiniMax 2.5, with long context and heavy agentic use, produces a large KV cache and long decode sequences, which is exactly the workload where moving state across tiers either pays off enormously or collapses under transfer latency. By leading with that specific model and that specific customer, VC2 is signaling it has the hard case working, not just a toy demo on a short prompt where the cache is small enough to hide the problem.
The deeper implication challenges an assumption that has anchored every AI infrastructure investment of the past two years: that the winner of inference would be whoever controlled the best single accelerator. VC2's wager is the opposite, that the winner will be whoever orchestrates the best combination, and that orchestration software and high-speed interconnect, not the chips themselves, become the defensible layer. If that is right, the trillions flowing into single-vendor GPU fleets are buying a commodity, and the margin migrates to the routing layer on top.
The bear case, however, is straightforward and serious. Multi-vendor disaggregation adds an integration tax that a homogeneous cloud never pays. Three vendors' drivers, three firmware update cycles, three support contracts, and a custom interconnect all have to stay in lockstep, and any one of them slipping degrades the whole pipeline. Critics argue that Nvidia can simply absorb the disaggregation advantage into its own Dynamo software and NVLink fabric, delivering most of the benefit inside a single ecosystem without the operational fragility of stitching three competitors together. If Nvidia does that, VC2's complexity becomes a liability rather than a moat.
Consider what this does to the software layer. The hard part of disaggregated inference is not owning three kinds of chips, it is the scheduler that decides, per request and per millisecond, where each phase runs and how to move state between tiers without stalling. That scheduler is the actual product, and it is software, not silicon. If VC2's orchestration proves durable, the company's moat looks less like a data center and more like a compiler, the kind of accumulating systems advantage that is hard to copy even when every underlying chip is available to competitors on the open market.
The counterargument from the GPU camp is that the gap VC2 exploits is temporary. Every generation of Nvidia hardware narrows the prefill-decode mismatch, and disaggregation inside a single NVLink domain already captures much of the theoretical gain without crossing vendor boundaries. Skeptics point out that betting a multibillion-dollar buildout on a hardware inefficiency invites the obvious risk that the next chip closes it. VC2's defenders counter that the mismatch is structural, rooted in the math of attention and memory bandwidth rather than any one product cycle, and that the gap has widened, not narrowed, as context windows have grown.
What to Watch Next
In the next 30 days, the number to demand is the actual Together.ai throughput and cost figure on MiniMax 2.5 against a clean Blackwell baseline. The launch claims the fastest enterprise inference to date but has not published the price-per-token delta that would prove the architecture pays for itself. Watch also for whether any independent benchmark, rather than a partner-supplied one, reproduces the result, because a multi-vendor pipeline is easy to tune for a single favorable demo and much harder to keep fast across diverse traffic.
Over the next 90 days, the tells are the Chicago, Seattle, and Phoenix buildouts moving from announced to operational, and a second named customer that is not itself an inference vendor. Together.ai is a sophisticated buyer that can exploit a complex stack, but VC2's real market is ordinary enterprises that want capacity without managing silicon. A named bank, retailer, or software company running production agents on VC2 would prove the cloud serves buyers who cannot tune the pipeline themselves. The absence of one by autumn would suggest the complexity is gating adoption.
On the 180-day horizon, the decisive question is margin. If VC2 can publish a gross margin that beats single-vendor neoclouds on the same models, disaggregation becomes the new default and Nvidia's rivals get a commercial lifeline. If it cannot, the $3.5 billion SambaNova commitment turns into stranded capital and the market concludes that one fast chip beats three coordinated ones. Either way, this is the first large-scale commercial test of whether the future of inference is bundled or unbundled, and the answer reshapes where AI infrastructure money flows next.
One more variable sits underneath all of it: power. A buildout across Los Angeles, Chicago, Seattle, Phoenix, and a planned fifty metros runs straight into the same grid constraints throttling every data center in 2026. A heterogeneous fleet of CPUs, GPUs, and RDUs carries a different power and cooling profile than a uniform GPU hall, and whether that mix lands above or below the energy cost of a pure-GPU rival will quietly decide more of VC2's economics than any single benchmark. Watch the utility agreements in those four launch metros as closely as the throughput charts, because cheap silicon on an expensive grid is not cheap inference.
Vector Core Compute is betting billions that the winner of AI inference will not be whoever owns the best chip, but whoever orchestrates the best combination of rival chips.
Key Takeaways
- $3.5 billion SambaNova compute commitment underwrites VC2, the first commercial cloud built for disaggregated inference, launched June 3 at Computex
- Three rival architectures in one query: Intel Xeon 6 orchestrates, Nvidia Blackwell runs prefill, SambaNova SN40 RDUs run decode
- Together.ai is the first customer and claims the fastest enterprise inference yet on the MiniMax 2.5 model
- 50+ US metros targeted, starting with Los Angeles and expanding to Chicago, Seattle, and Phoenix
- The KV cache transfer between prefill and decode tiers is the make-or-break cost, and Nvidia's Dynamo software is the obvious counterattack
Questions Worth Asking
- If disaggregated inference wins, does the value in AI compute migrate from the chip to the orchestration layer, and who owns that layer?
- Can a private equity firm operate a three-vendor silicon pipeline as reliably as a hyperscaler runs a homogeneous fleet?
- If your business depends on inference cost, are you buying the cheapest GPU hour, or are you buying the wrong architecture for half your workload?