Big Tech

AMD's MI350X Takes the Inference Crown From NVIDIA

AMD's MI350X GPU pairs with Fireworks AI for open inference APIs on 288GB HBM3E hardware, challenging NVIDIA's grip on AI compute infrastructure.

Share:XLinkedIn
AMD's MI350X Takes the Inference Crown From NVIDIA

Key Takeaways

  • MI350X delivers 2.6x H100 inference throughput on Llama 3.1 405B, with 288GB HBM3E memory and 8TB/s bandwidth, backed by MLPerf Inference 6.0 results showing up to 2.05x the FP6 performance of NVIDIA's B200.
  • 25 million free API tokens at launch via AMD AI Developer Program lower the barrier for developers to test AMD-powered inference without infrastructure commitment or CUDA porting cost.
  • Together AI, Fireworks AI, and Anyscale have adopted MI350X for production serving, providing three credible independent validations that the chip's inference economics hold in real-world environments.
  • Inference token demand is growing 20 times year over year, making inference compute the highest-growth segment of the AI hardware market and where AMD's performance-per-dollar advantage is most consequential.
  • The Fireworks AI API partnership routes around AMD's ROCm gap by offering OpenAI-compatible endpoints that abstract away the hardware layer, making AMD accessible without CUDA porting cost.

AMD just posted its strongest inference benchmark numbers ever, and the companies actually buying the hardware are not waiting for NVIDIA's permission to switch. Together AI, Fireworks AI, and Anyscale have all publicly moved production serving workloads to the AMD Instinct MI350X, citing 2.6 times the inference throughput of an H100 on Llama 3.1 405B at scale. The inference compute market is not training. It's where the money flows every single day, at billions of requests per hour, for every AI product in production worldwide. AMD just planted its flag there.

What Actually Happened

At AMD AI DevDay 2026 in San Francisco, AMD unveiled its AI Endpoint APIs in partnership with Fireworks AI, running on the Instinct MI350X GPU. The offering gives developers OpenAI-compatible API access to production-ready LLM inference on AMD hardware, with no infrastructure to manage. Developers who join the AMD AI Developer Program receive 25 million free API tokens at launch for immediate experimentation with production-grade models. The goal is explicit: eliminate the activation energy for developers to experience AMD-powered inference before any procurement or migration decision.

The MI350X itself is built on AMD's 4th Gen CDNA architecture with 288GB HBM3E memory and 8TB/s bandwidth, alongside expanded MXFP6 and MXFP4 datatype support designed to minimize precision loss on inference workloads. AMD's MLPerf Inference 6.0 results, released alongside the DevDay announcements, show the MI350X delivering up to 2.05 times the FP6 performance of NVIDIA's B200 on select benchmarks. The MI355X platform, the higher-spec variant, showed 4.2 times better throughput than the previous-generation MI300X in AI agent and chatbot workloads, and 2.9 times better performance in content generation tasks. These are not synthetic benchmarks from controlled lab conditions. They are production inference measurements on the exact workloads that AI companies run every day to serve paying customers.

Fireworks AI's role in this announcement is structural, not cosmetic. The partnership means AMD is not asking developers to understand ROCm, to port CUDA kernels, or to manage GPU cluster provisioning. Fireworks provides the full inference stack, AMD provides the hardware, and developers get a clean REST API that is drop-in compatible with any OpenAI SDK. The commercial launch is backed by the AMD AI Developer Program, which coordinates access to compute credits, technical documentation, and engineering support for teams evaluating production migration.

Stay Ahead

Get daily AI signals before the market moves.

Join founders, investors, and operators reading TechFastForward.

Why This Matters More Than People Think

The AI compute conversation has been dominated by training hardware since 2020: which GPU trains GPT-N fastest, which cluster assembles 100,000 H100s most efficiently, which hyperscaler wins the next major foundation model contract. But inference token demand is growing 20 times year over year in 2026, and inference, not training, is where AI company revenue actually materializes. Training a frontier model happens a handful of times per year. Inference runs billions of times per day across every product that touches a user. The unit economics of inference determine whether AI-native businesses are profitable, breakeven, or subsidizing their customers at scale.

This is why the MI350X announcement matters well beyond hardware specs. If AMD delivers 2.6 times the inference throughput of an H100 at comparable or lower acquisition cost, the cost-per-token economics flip decisively in AMD's favor for inference-heavy workloads. Every inference startup, every enterprise running AI at scale, and every hyperscaler serving billions of API calls daily has an immediate financial incentive to evaluate AMD. The fact that Together AI, Fireworks AI, and Anyscale have already moved production workloads to MI350X tells you this evaluation has happened at the companies most sensitive to inference unit economics, and AMD won. These are not press-release adopters. They are infrastructure companies whose profitability depends on getting cost-per-token right.

The Competitive Landscape

NVIDIA's position in AI compute rests on two pillars: hardware performance and the CUDA software ecosystem. The hardware performance gap has been narrowing since the MI300X, which delivered serious competition in memory bandwidth when NVIDIA's H100 was memory-constrained on the largest models. The MI350X continues that trajectory and extends AMD's advantage into inference-specific precision formats. But CUDA is a different kind of moat. NVIDIA has spent more than 15 years building libraries, frameworks, and developer tools on CUDA. The vast majority of the world's AI research code, production code, and tooling is written to run on CUDA. Switching costs are real: porting a production serving stack from CUDA to AMD's ROCm framework requires engineering investment, reliability testing, and organizational risk-taking that most teams won't absorb without a compelling financial reason.

Critics argue, however, that AMD's ROCm software ecosystem still lags CUDA in library coverage, debugging tooling, and long-tail framework support. The risk is that teams adopting MI350X for production inference discover edge cases in ROCm's compatibility layer that require workarounds, adding engineering overhead that erodes the cost-per-token savings. Several ML engineering teams at larger enterprises have publicly noted that ROCm's support for custom CUDA extensions remains inconsistent, meaning any inference stack that relies on optimized CUDA kernels for specific attention mechanisms or quantization schemes may face unexpected porting complexity. AMD's answer to this is the Fireworks AI abstraction layer, but that answer only works for teams content to treat inference as a black-box API. Teams who need hardware-level control, custom kernels, or fine-grained profiling still face the full CUDA migration challenge.

The inference API market is already fracturing along these lines. Fireworks AI has built its cost advantage on AMD hardware. Together AI operates AMD MI300X and MI350X clusters alongside NVIDIA H100s, routing workloads based on cost and latency requirements. Groq, with its Language Processing Unit architecture, competes on latency rather than throughput. Anyscale's decision to adopt MI350X for production reflects a bet that the ROCm ecosystem is mature enough for mission-critical workloads, which was not a safe bet 18 months ago. Meanwhile, NVIDIA is not standing still: its TensorRT-LLM inference optimization stack has continued to advance, and the GB200 NVL72 rack-scale system delivers multi-node inference at a scale where AMD has no current direct answer.

Hidden Insight: The Second-Supplier Problem

The structural story here is not AMD versus NVIDIA. It's what happens to AI product companies when one supplier controls the inference compute market. NVIDIA's data center GPU revenue reached $39.3 billion in Q4 fiscal 2026, with gross margins exceeding 73 percent. Those margins are not a sign of a healthy competitive market. They reflect the pricing power of a near-monopolist selling into a demand surge that no competitor could immediately satisfy. Every AI company paying NVIDIA's current inference pricing is implicitly subsidizing a market structure that benefits from AMD, Groq, and custom silicon succeeding as viable alternatives.

The hyperscalers understand this dynamic better than anyone. Microsoft Azure has been building AMD GPU capacity alongside NVIDIA capacity for the past two years. Google Cloud runs AMD Instinct GPUs in its own data centers. Amazon AWS added MI300X instances in late 2024. These are not goodwill purchases toward AMD. They are strategic hedges against NVIDIA supply constraints and pricing leverage. The hyperscalers have been the primary buyers of NVIDIA's supply since 2022, and they have no interest in remaining price-takers in perpetuity. Every MI350X GPU instance they build gives them negotiating leverage in the next NVIDIA contract renewal.

The deeper implication is for open-weight model inference economics. Models like Llama 3.1 405B, DeepSeek V4, and the Qwen 3 family are being served at commercial scale by inference providers, and their cost-per-token is what determines whether open-weight models can outcompete proprietary API models for enterprise customers. If AMD delivers 2.6 times the throughput of an H100 on Llama 3.1 405B, open-weight model inference providers can undercut proprietary model APIs on both cost and performance simultaneously. That economic equation is what makes the MI350X announcement strategically consequential beyond AMD's own revenue trajectory. It's a structural enabler for the open-weight ecosystem to compete at pricing levels that closed-model providers cannot match without cutting their own margins.

What to Watch Next

The critical test arrives in the next 90 days as AMD's AI Endpoint APIs scale from launch to production volume. Watch Fireworks AI's public benchmark disclosures against its NVIDIA-backed API endpoints. If AMD-powered inference is consistently faster or cheaper on the models that production inference companies actually serve, expect migration from cost-sensitive startups to accelerate before the end of Q3 2026. The 25 million free token offer is a deliberate trial funnel. The conversion rate from trial to paid production commitments will be the first real signal that developer experience on AMD hardware has reached parity with the CUDA toolchain for the inference use case specifically.

The second indicator to watch is hyperscaler instance availability in standard catalog. When Azure, AWS, or Google Cloud lists MI350X GPU instances as general availability rather than limited preview, it means AMD has passed enterprise-grade reliability validation at the largest scale in the world. Based on AMD's roadmap disclosures, that announcement is expected in H2 2026. When it arrives, NVIDIA's ability to command inference pricing premiums will face structural pressure for the first time. The bear case for NVIDIA is not that AMD outperforms it on raw FP8 throughput. The bear case is that AMD captures 25 to 30 percent of the inference compute market at 30 to 40 percent lower cost-per-token, compressing NVIDIA's effective pricing power across the entire inference tier even where NVIDIA still wins the performance crown.

AMD didn't beat NVIDIA's CUDA moat by building a better CUDA: it built an API compatibility layer that makes the moat invisible to the developers who are actually paying for inference compute today.


Key Takeaways

  • MI350X delivers 2.6x H100 inference throughput on Llama 3.1 405B, with 288GB HBM3E memory and 8TB/s bandwidth, with MLPerf Inference 6.0 results showing up to 2.05x the FP6 performance of NVIDIA's B200.
  • 25 million free API tokens at launch via AMD AI Developer Program lower the barrier for developers to test AMD-powered inference without infrastructure commitment or CUDA porting cost.
  • Together AI, Fireworks AI, and Anyscale have adopted MI350X for production serving, providing three credible independent validations that the chip's inference economics hold in real-world workload environments.
  • Inference token demand is growing 20 times year over year, making inference compute the highest-growth segment of the AI hardware market and the segment where AMD's performance-per-dollar advantage is most commercially consequential.
  • The Fireworks AI API partnership routes around AMD's ROCm gap by offering OpenAI-compatible endpoints that abstract away the hardware layer, making AMD accessible to developers who cannot afford to port production CUDA codebases.

Questions Worth Asking

  1. If AMD's inference cost advantage is real at scale, why are the major hyperscalers still allocating the majority of their GPU capex to NVIDIA, and what specific commitments would signal a genuine allocation shift?
  2. The Fireworks AI API abstraction solves the CUDA porting problem for API consumers, but what happens to enterprises that need hardware-level control over their inference stack, or whose custom CUDA kernels have no ROCm equivalent?
  3. If inference compute becomes a commodity priced on cost-per-token rather than brand and ecosystem loyalty, what does that mean for AI companies whose business models assume continued NVIDIA pricing power as a margin floor?
Newsletter

Enjoyed this analysis? Get the next one in your inbox.

Daily AI signals. No noise. Built for founders, investors, and operators.

Share:XLinkedIn
</> Embed this article

Copy the iframe code below to embed on your site:

<iframe src="https://techfastforward.com/embed/amd-mi350x-fireworks-ai-inference-apis-devday-2026" width="480" height="260" frameborder="0" style="border-radius:16px;max-width:100%;" loading="lazy"></iframe>