The benchmark gap that Western AI labs built their pricing power on is closing faster than anyone predicted. Zhipu AI GLM-4.7, released on December 22, 2025, scores 73.8% on SWE-bench , the same test that determines whether AI can actually resolve real software bugs , while costing three dollars a month. GitHub Copilot costs $10. Cursor Pro costs $20. The math is getting uncomfortable for U.S. AI companies.
What Actually Happened
On December 22, 2025, Beijing-based Zhipu AI released GLM-4.7, a ~400 billion parameter open-weight model with a 200,000-token context window and 128,000-token maximum output capacity. The model targets software engineering specifically , its training and evaluation focused on agentic coding tasks, multi-file refactors, and terminal operations. GLM-4.7 achieved 73.8% on SWE-bench Verified, a 5.8 percentage point improvement over GLM-4.6, along with 66.7% on SWE-bench Multilingual (up 12.9 points) and 84.9 on LiveCodeBench V6 , both surpassing Claude Sonnet 4.5's published scores on the same evaluations.
Zhipu followed up on January 20, 2026 with GLM-4.7-Flash , a 30 billion parameter Mixture-of-Experts model with only 3 billion active parameters per token, designed to run entirely on consumer hardware. Weights for both models are available open-source on Hugging Face with day-zero support in vLLM and native integration incoming in Ollama 0.14.3. For developers who want a managed API, the pricing is $3 per month. For developers who want zero recurring cost, the Flash variant runs locally on a mid-range gaming GPU. This is not a research preview. It is a production coding model at a subscription price that rounds to zero.
Why This Matters More Than People Think
The AI coding market has quietly become one of the most lucrative software categories of the 2025 2026 cycle. Cursor raised $2 billion at a $50 billion valuation in April 2026. GitHub Copilot counts over 15 million users. Microsoft bundled Copilot into its $99/month E7 enterprise suite, monetizing the coding AI layer at scale. The implicit assumption behind all of these valuations is that benchmark leadership justifies premium pricing. GLM-4.7 attacks that assumption directly. When a model costing $3/month beats a $20/month competitor on three separate benchmarks, the premium pricing story requires a different kind of justification , and most companies charging those premiums do not have one ready.
Beyond the price, the open-weight release is the more strategically significant move. Open weights mean enterprises can fine-tune GLM-4.7 on their proprietary codebases , something impossible with closed models like Claude or GPT-4o. For industries with compliance requirements around data residency (finance, healthcare, defense), a frontier-grade coding model that runs entirely on-premise is not just cheaper , it is the only option regulators will accept. Zhipu AI did not just release a model. It opened a door that every major U.S. AI vendor has deliberately kept closed, because behind it lives a market they cannot reach: regulated enterprises that cannot send code to a third-party API under any circumstances.
The Competitive Landscape
SWE-bench is now the single most scrutinized benchmark in AI coding, and the leaders cluster tightly near the top. As of early 2026, Claude Sonnet 4.5 and Cursor's default configurations sit around 75 77% on SWE-bench Verified. DeepSeek V4 Pro and Google Antigravity compete in the same tier. GLM-4.7 at 73.8% is close enough that enterprises choosing a coding AI on benchmark scores alone would need to look elsewhere for differentiation , and when they do, they encounter the price. Paying 6x more for a 2 3 point SWE-bench advantage is a defensible position. Paying 66x more requires an entirely different argument, and that argument usually comes down to support, integration, and workflow , moats that erode as open-source tooling matures.
GitHub Copilot's competitive moat has always been distribution , it lives inside VSCode, where more than 70% of developers already work. Cursor's moat is workflow integration and the agent experience. GLM-4.7's moat is cost and open access. These are different markets, and they are beginning to overlap at the enterprise tier. Coding assistants started as autocomplete. They graduated to file-level agents. GLM-4.7 targets the next stage: repo-level agentic operation with preserved reasoning chains across multi-turn sessions. That is Cursor's territory, and GLM-4.7 is walking onto it for $3/month.
Hidden Insight: The Feature That Changes the Debugging Loop
Most coverage of GLM-4.7 focuses on the SWE-bench number. The more interesting innovation is buried in the technical documentation: Preserved Thinking. Conventional AI coding assistants reset their reasoning state at the end of each turn. Ask them why they wrote a function a certain way three messages ago, and they have forgotten entirely. GLM-4.7 maintains its reasoning chain across multiple turns , the model remembers not just what it produced, but why. In complex, multi-session debugging work, this difference is enormous. Developers who have used current coding AIs know the frustration of watching an agent spend five minutes reasoning through a problem, then losing all of that context on the very next message. GLM-4.7 eliminates that cycle.
The deeper implication is architectural. Preserved Thinking is not just a usability feature , it points toward a fundamentally different approach to AI cognition in software development. Human programmers build mental models of codebases incrementally and reference them over hours or days. AI coding assistants that reset reasoning state at each turn are functionally amnesiac: they simulate intelligence without accumulating understanding. GLM-4.7's multi-turn reasoning chains are a small step toward models that actually build a working model of a codebase over time, rather than reconstructing it from scratch with each new prompt.
The tau-Bench interactive tool invocation score of 84.7 , which surpasses Claude Sonnet 4.5 , is the quantitative signal that Preserved Thinking is not marketing language. Interactive tool invocation requires holding context across turns to use tools effectively. When GLM-4.7 scores higher than Claude on this specific metric, it is demonstrating a capability gap that current coding benchmarks barely capture. The AI coding industry has optimized almost entirely for single-turn task completion. The next competitive frontier is multi-session coherence , the ability to maintain a working model of a problem across an entire debugging session. GLM-4.7 is already competing there, at a price point that makes the current generation of coding AI subscriptions look like a historical anomaly.
What to Watch Next
Watch the Ollama integration closely. When GLM-4.7-Flash lands in Ollama 0.14.3, it triggers a specific adoption inflection point: any developer with a mid-range gaming GPU can run a 73.8% SWE-bench model on their own hardware with a single CLI command. Every enterprise that has blocked cloud AI coding tools for data residency reasons suddenly has a viable on-premise alternative. Track Ollama download numbers for GLM-4.7-Flash specifically , a significant jump after the 0.14.3 release would confirm that the compliance-driven enterprise use case is real and large enough to matter to U.S. vendors.
The more important 90-day signal is whether any U.S. coding AI company cuts prices in response. Cursor has maintained $20/month Pro pricing; GitHub Copilot Individual stays at $10/month. If GLM-4.7 adoption figures become public and show material enterprise penetration, expect a pricing response within one quarter. If no U.S. company blinks, that tells you they are confident their moats , workflow integration, VSCode distribution, enterprise support contracts , are deeper than benchmark parity and price compression alone. The absence of a price cut will be as informative as a price cut itself. Also track GLM-5 trajectory: Zhipu is iterating fast, and a follow-on release could push SWE-bench past 80% before Q4 2026, at which point the pricing argument becomes moot and the capability argument begins.
When a Chinese lab ships a frontier coding model for $3 a month and beats the $20 model on three separate benchmarks, the question is not whether the price gap is sustainable , it is whether the performance gap ever comes back.
Key Takeaways
- 73.8% on SWE-bench Verified , GLM-4.7 matches or exceeds several premium Western coding models on the benchmark that measures real software engineering ability, not just code completion speed
- $3/month managed API , 6 66x cheaper than comparable U.S. coding AI subscriptions, with open weights available free for local deployment on consumer hardware
- Preserved Thinking architecture , GLM-4.7 maintains multi-turn reasoning chains across sessions, directly addressing the core limitation of current AI coding assistants in complex debugging work
- 84.7 on tau-Bench interactive tool use , surpasses Claude Sonnet 4.5 on the metric that agentic, multi-step software engineering actually requires day-to-day
- GLM-4.7-Flash: 30B-A3B MoE for consumer hardware , only 3B active parameters per token enables fully on-premise deployment, opening frontier-grade coding AI to regulated industries for the first time
Questions Worth Asking
- If a $3/month open-weight model closes to within 3 SWE-bench points of Claude Sonnet 4.5, what exactly are enterprises paying $200/month for , and is that differentiation sustainable as Chinese labs iterate on a quarterly release cadence?
- Preserved Thinking changes the human-AI debugging loop in ways that current benchmarks do not capture: what does AI coding tooling look like in 18 months when multi-session coherence becomes table stakes rather than a differentiator?
- For your team's most sensitive codebase , the one you would never route through a cloud API , what is your on-premise AI coding strategy, and does it currently assume access to frontier-grade performance?