When Code Becomes a Commodity: The Mathematician's Edge
In February 2026, a lawyer won the Anthropic Hackathon. A cardiologist took third with an AI care platform built in seven days. Thirteen thousand applicants competed; neither winner wrote code. Over the same window, the US Bureau of Labor Statistics reports programmer roles down 27.5% from 2023 to 2025, while Levels.fyi tracks the new “AI Engineer” specialty paying an 18.7% staff premium and named by LinkedIn the role growing fastest for the second year running.
That raises the question: if code is becoming a commodity, what stays scarce?
This piece works through three sources of evidence: AI capability benchmarks in mathematics, US labor market data on engineering roles, and a 2026 paper by Klowden and Tao on the structure of human work alongside AI. The numbers reflect an April 2026 snapshot. The argument is structural, not predictive: the boundary where AI stops working has a shape, and that shape rewards a specific kind of human judgment.
Based on a talk delivered at the Mathematics Department of UFSC on 24 April 2026 to math PhD and master’s students. Slides: English (PDF) · Português (PDF).
Key terms
- IMO (International Mathematical Olympiad): annual high school competition; standard reference benchmark for AI mathematical reasoning.
- USAMO (USA Mathematical Olympiad): the US national qualifier for the IMO; proof-based problems graded by human judges.
- FrontierMath Tier 4 (Epoch AI benchmark): research grade mathematics problems designed to remain unsaturated by current models.
- Lean 4 / Mathlib: a formal proof language and its core library; Mathlib carries roughly 1.9 million lines of verified mathematics maintained by 500-plus contributors.
- Smell test (Klowden & Tao §4.2): the skepticism mathematicians apply before reading a proof line by line, used to detect bad arguments early.
- Odorless proof (K&T §4.6 n16): a formally correct output that lacks the causal narrative humans use to judge mathematical work.
- Blue team / red team (K&T §6.2): generation versus verification roles, borrowed from cybersecurity. K&T argue AI is relatively safe on the red team, unsafe on the blue.
- Software 3.0 (Karpathy, June 2025): software produced by description in natural language, distinct from code humans write (1.0) or models humans train (2.0).
Why this time is different
Past automations affected the periphery of intellectual work: the printing press, the typewriter, LaTeX, automated dissemination. They changed how ideas were copied, transmitted, and rendered. The work of generating ideas, of deciding what to write, stayed human. Klowden and Tao open their 2026 paper with this distinction. Modern AI is the first general purpose technology that automates the creation step itself, not just the dissemination around it.
The point matters for what follows. If a tool only automates dissemination, the question “what is scarce” stays trivially answered: the production of new ideas. Once the tool starts producing outputs that look like ideas (proofs, drafts, code, diagrams), the question becomes structural. Rhetoric will not answer it. The honest place to look is where the tool stops working.
Criterion and difficulty: the gap has two sources
By April 2026, leading models saturate above 99% on competition mathematics when only the final answer is graded. They lose ground along two axes: when the criterion shifts to graded proofs, and when the difficulty rises to research level.
| Benchmark | Type | Criterion graded | Score (April 2026) |
|---|---|---|---|
| GSM-8K | Grade school | final answer | ~100% |
| AIME 2025 | US olympiad qualifier | final answer (integer) | 100% with code; 99.8% without |
| MATH-500 | High school + early college | final answer | 95-99% (Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, DeepSeek-R1) |
| FrontierMath Tier 4 | Research grade | final answer | ~38% (GPT-5.4 Pro, March 2026; up from 19% five months earlier) |
| IMO 2025 | International Olympiad | full proof, IMO judges | 35/42 ≈ 83% (Gemini Deep Think) |
| Open Proof Corpus | Olympiad proofs | full proof, human-graded | ~43% (5,062 proofs, 6 LLMs, 13 judges) |
| FormalProofBench (Lean 4) | Graduate proofs | formal proof in Lean | ~33.5% (frontier model) |
What the table shows: two factors push the scores down at once.
On the same problem set, changing only the criterion (from “final answer correct” to “full proof correct”) is enough to drop the score. Over the same 5,062 problems, the Open Proof Corpus measured drops of 8 percentage points (Gemini-2.5-Pro) to 30 points (o3), depending on the model.
Holding the final-answer criterion fixed, difficulty weighs even more. AIME 2025 (olympiad qualifier) clears 99%; FrontierMath Tier 4, with the same criterion but research-grade problems, sits at ~38%. That is roughly 60 percentage points coming from the difficulty jump alone.
Crossing the two axes, Open Proof Corpus lands at 43% (olympiad problems with human-graded proofs) and FormalProofBench at 33.5% (graduate problems with formal Lean proofs).
The difference is not measurement noise. It tracks two boundaries at once: criterion (right answer vs verifiable proof) and difficulty (solvable problem vs research problem). The rest of this piece is organized around these two axes.
IMO 2025 is the most visible crossing of the two axes: difficulty at the top of the pre-college olympiad pyramid, with proofs graded by the official judges reading the full argument. What stands out is that, even at this combination, the 35/42 ≈ 83% scored by Gemini Deep Think and the experimental OpenAI model already sits at gold medal level, and the jump to that mark was fast. In 2024, DeepMind’s AlphaProof scored 28/42 (silver medal) with formal Lean 4 proofs and several days of compute per problem. In 2025, Gemini Deep Think reached 35/42 in natural language inside the 4.5 hour exam window, certified by IMO coordinators. Twelve months later, what required formal Lean and days of compute now fits in natural language within the exam window. Even so, problem 6 went unsolved, and the gap to AIME’s 99.8% keeps the criterion as a live boundary.
Final-answer benchmarks at olympiad level (MATH-500, AIME 2025, GSM-8K) no longer differentiate the leading models. A few years ago they sat comfortably out of reach. Today they are the baseline.
News of AI progress arrives in trumpets. But seen from the side of full proofs and research problems, the music is still violins. The interesting work starts below the cliff.
What the cliff is made of
The previous section’s numbers say the cliff is there; they do not say how it forms. Recent literature on reasoning failures in frontier models points to three recurring modes:
- A single hallucinated step invalidates an entire chain of derivation.
- Working memory collapses around 20 to 30 parallel branches regardless of model scale; the ceiling does not move with parameter count.
- Apple’s research group documented “complete accuracy collapse beyond certain complexities” in 2025: reasoning effort grows for a while, then declines, as if the system gives up.
Klowden and Tao name the deepest version of this failure “odorless proofs.” AlphaProof’s IMO 2024 solutions were verified correct by Lean. They also contained redundant steps and inexplicable jumps. The proofs were formally right without being legibly right: they lacked what Thurston (1994/2006) identified as the mark of good mathematics, a causal narrative that lets a reader see why the conclusion follows, not just that it does. The system passes the verifier and fails the smell test the mathematician would apply before opening Lean.
That distinction sets up the rest of the argument. The labor market is registering the same boundary from a different angle.
The labor data, read alongside the cliff
A note on framing. This analysis reads the software engineering job market through eight years of industry experience, including machine learning consulting and quantitative finance. The mathematical material is closer to home than the early empirical material; the bias is flagged because the corporate signal cases below skew toward what the author has personally observed.
The empirical record on junior software engineering is now thick.
| Source | Population | Finding |
|---|---|---|
| Stanford Digital Economy Lab (ADP) | US developers 22-25 | -20% employment from late-2022 peak |
| Harvard / Revelio Labs | 62M workers, 285K firms | Junior employment in firms that adopt GenAI -7.7% after 6 quarters vs. controls |
| Northeastern (Lightcast) | US vacancies | -16.3% relative junior to senior posting ratio |
| Institute of Student Employers (UK) | Entry level tech | -46% over 2024 |
| NY Fed College Labor Market | CS graduates | unemployment ~6.1% |
| Levels.fyi | “AI Engineer” specialty | +18.7% staff compensation premium |
| US BLS | Software developers (aggregate) | Projected +15% growth through 2034 |
| LeadDev AI Impact Report (2025) | Engineering leaders | 54% expect fewer junior hires over the long term |
What it shows: bifurcation, not extinction. The programmer role focused on execution is shrinking; the design oriented role that collaborates with AI is expanding. These are two different jobs that happen to share one historical label.
Corporate signaling lines up with the data, though the signal is mixed with narrative. Salesforce announced zero new engineer hires for fiscal year 2026. Shopify added a policy requiring written justification that AI cannot do the work before any new headcount can be approved. Microsoft attributed roughly 6,000 layoffs in May 2025 partly to internal AI productivity, citing 20 to 30% of new code being written by AI. Amazon announced 14,000 corporate cuts in October 2025 with AI cited.
But there are signs that complicate the picture. The Federal Reserve, in a March 2026 study covering more than a million firms, found no statistically significant link between AI adoption and aggregate hiring. Klarna, the headline case for AI driven workforce reduction, cut headcount from roughly 5,500 to 3,000 while pushing AI, watched revenue per employee multiply about fourfold from $300K to $1.3M, and then partially reversed in May 2025, rehiring humans after customer-service quality dropped (no surprises there). The BLS itself projects +15% aggregate growth for the sector through 2034.
But in fact, there is little statistical support for any conclusive claim. Where AI can effectively substitute for a human, and how much of the recent shift can be attributed specifically to it, are open questions that evidence and time have not yet clarified.
A personal observation. A treasury operation the author previously managed and developed with two or three colleagues today fits in one person, with AI. More recently, he works as a solo risk scientist, also serving as the project’s lead engineer. It is the kind of observation that the macro evidence will eventually have to confirm or discard. The anecdote points in one direction: engineering capacity, which used to be the main bottleneck, has become a commodity.
Where the bottleneck moved
McKinsey’s 2025 State of AI report finds that only 39% of enterprises that adopted generative AI report measurable EBIT (earnings before interest and taxes) impact. IBM’s institute reports that one in four AI projects delivers promised returns. Faktion’s June 2025 essay diagnoses the recurring failure mode: “knowledge is scattered across documentation, legacy systems, and most critically, in the heads of a few seasoned experts.” Boris Cherny, head of Claude Code at Anthropic, on Lenny’s Podcast in February 2026: “the core problem now is knowing what the hell you actually want to build.”
Both signals converge on the same place: when engineering stops being the bottleneck, what remains is the work of knowing what to build, a role carried mainly by domain specialists.
The constraint has shifted from engineering capacity to problem framing, domain knowledge, and the ability to say what success looks like in a setting the AI cannot infer. Five niches in mathematics make that boundary concrete (qualitative selection, based on what the author observes):
- Alignment research at Anthropic and the MATS program, hiring for reward hacking analysis and work on training dynamics.
- Formal verification specialists, $70-150 per hour through Alignerr, requiring Lean 4 and Mathlib fluency.
- ML theory: a thin field with strong appetite for interpretability and generalization work, where the absence of a settled mathematical framework is itself the opportunity.
- Quantitative finance: stochastic modeling combined with machine learning, sustained demand for quants who can do both.
- Operations research: the work of formulating problems. Defining variables, constraints, and the objective function. That formulation step is exactly what the AI cannot do without a human who knows what is being optimized.
These are domains where rigor and proof cannot be automated, and where AI amplifies the mathematician without substituting for the mathematician’s judgment. The bifurcation in the labor data and the cliff in the capability data point to the same gap.
The smell test, formalized
Mathematicians have a name for the missing skill, though it does not appear in any AI lab’s deck. Scott Aaronson’s 2008 blog post Ten Signs a Claimed Mathematical Breakthrough is Wrong lists ten heuristics experienced mathematicians apply, often subconsciously, to detect bad arguments well before line by line verification:
- The authors do not use TeX.
- The authors do not understand the question.
- The approach implies something much stronger and probably false.
- The approach contradicts a known impossibility result.
- The authors quietly switch from “we prove” to “seems to work in all cases we tried” partway through.
- The paper jumps to technicalities without a new idea.
- The paper does not build on prior work.
- The paper wastes space defining standard terms.
- The paper waxes poetic about “practical consequences,” “profound philosophical implications,” etc.
- The techniques look too weak for the problem.
Klowden and Tao §4.2 call this the “smell test” and argue it is the layer of mathematical practice most resistant to current AI architectures. The argument is mechanical, not magical. LLMs are trained to produce outputs that look correct, which is exactly the failure mode the smell test was constructed to detect. Several of Aaronson’s signs (the weasel words at #5, the technicalities without idea at #6, the rhetoric about implications at #9) are patterns that the current models exhibit by construction. Detecting them takes three things:
- Deep familiarity with prior work.
- Intuition for which kinds of claims tend to be true.
- A feel for how mathematicians react to a new claim.
None of these are reliably produced by current LLMs. This explains the cliff’s location. The cliff is not arbitrary. It tracks the boundary between problems where the smell test is unnecessary (final answer arithmetic at AIME level) and problems where the smell test is the skill that holds the work up (research grade proof verification, novel claim assessment, paper refereeing). Scaling raw model capability has not closed this boundary, because current architectures were not designed to address it.
The current toolkit
The frontier of “AI in math research” is concrete, not hypothetical. November 2025: Terence Tao published a public write up describing a year of collaboration with Google DeepMind. AlphaEvolve plus Gemini Deep Think were applied to 67 open mathematical problems across combinatorics, analysis, and number theory, producing roughly 20 genuinely new results, with a subset of proofs passing formal verification in Lean 4. The workflow adopted was the following:
flowchart LR
A[AlphaEvolve
discovers candidate] --> B[Deep Think
sketches proof]
B --> C[AlphaProof
formalizes in Lean]
C --> D[Lean 4
verified]
Tao selects and evaluates the problems at the start of the loop and judges the resulting insight at the end. The middle is automated. The boundaries are not.
The same pattern at a more accessible scale: in August 2025, Ernest Ryu (Seoul National University, formerly UCLA) used GPT-5 as an assistant and improved by about 50% the best known bound on a 40-year open problem on operator splitting and first-order methods in convex optimization. The result was published as a rigorous note, not just a heuristic.
Same pattern: the human picks the problem and judges the insight; the machine works on the intermediate steps.
The infrastructure is open and documented. Lean 4 with Mathlib carries roughly 1.9 million lines of formally verified mathematics. LeanCopilot from Caltech suggests Lean tactics during proofs. DeepSeek-Prover-V2 splits problems into smaller ones (divide and conquer). Kimina-Prover from Moonshot was the first prover to clear 80% on MiniF2F. None of this requires special access.
Gated-access systems extend the ceiling: AlphaProof, AlphaEvolve, and the Claude Mythos preview in Anthropic’s Project Glasswing (which reported 97.6% on USAMO 2026, with no independently graded score published yet). Even so, the public toolchain already serves as a capable assistant for a mathematician doing research-grade work.
A second point worth mentioning, especially for those who don’t write code. Andrej Karpathy’s “Software 3.0” framing from June 2025 names a new mode: programs humans describe in natural language.
- Software 1.0: code humans write.
- Software 2.0: models humans train.
- Software 3.0: software produced by description in natural language.
Trent Field, who does not write code professionally, works at Arkance and presented at Autodesk University 2025 a working web application he built solo in 48 hours using Cursor. The lawyer, the cardiologist, and the mathematicians proving new results in collaboration with AI are not exceptions; they mark the same pattern: AI has crossed from dissemination into creation. The arrangement only works as long as the human can verify what the AI produces.
The operating rule
Klowden and Tao §6.2 propose a rule with vocabulary borrowed from cybersecurity. AI is relatively safe to use as a red team (reviewing content generated by humans for errors or suggested improvements) and unsafe to trust as a blue team (generating structural content beyond what a red team can verify). The red team can be human reviewers, formal proof assistants like Lean, or other AI systems running in a verification capacity. What the blue team produces must remain inside what the red team can check.
The rule is easy to state and easy to violate. While drafting the talk that originated this article, the author asked an AI to write a slide explaining the K&T framing. It produced “AI is potent on the blue team, the red team is a human function,” which mixes up the concepts in K&T’s thesis. The author noticed the confusion on first reading. Two further rounds of feedback to the AI failed to surface it; the wrong slide kept coming back. K&T’s thesis demonstrated itself: the smell test is the layer no current model reliably has, and it is what produced the corrected paragraph.
A second instance of the same effect surfaced after the talk. A graduate student who attended pointed out that the slides used em-dashes liberally, a tell of AI-generated content. He was not checking the math. He was applying the smell test to the medium and reading the signature on the surface.
The operating rule, then, in two sentences:
Let AI verify your work, not generate work you can’t verify.
When verifying AI work, apply your mathematical smell test to invalidate arguments; the model doesn’t have it.
Limitations
The benchmark scores in this text are a snapshot up to April 18, 2026 (GPT-5.5, released April 23, and DeepSeek V4, in preview since April 24, are therefore outside this analysis). That said, it’s important to acknowledge that such numbers go out of date quickly; expect the specific figures to be out of date within weeks. Tao’s own public characterization of AI for research moved across eighteen months:
The rate of motion is not slowing. The release of GPT-5.4 to GPT-5.5, for example, took roughly seven weeks (March 5 to April 23, 2026), and FrontierMath Tier 4 jumped from 38% to 39.6% (GPT-5.5 Pro) in that interval.
The labor data presented in this text is heavily concentrated in the US. The Federal Reserve null result and the Klarna reversal show real heterogeneity in how AI adoption translates to headcount. The mathematical edge framing presumes the smell test stays out of reach for current architectures. That is a claim about how today’s models are trained, not a prediction about what a future model will be able to do. If the architecture changes, the structural argument has to be reconsidered.
A final caveat. The corporate signaling cases (Salesforce, Shopify, Microsoft, Amazon) carry narrative weight and financial incentives in addition to hiring intent. CEO public statements are written for investors first; whether they reflect operational reality is a separate question. The two recurring gaps (when not chasms) are:
- Deliberately misstated claims;
- Selective framing.
Both typically serve to ride the narrative of the moment. Selective framing, in particular, shows up in efficiency gains attributed to AI when other cost programs were the bigger driver, or workforce changes attributed to automation when the cuts were already planned regardless of AI.
Read between the lines. These companies’ cash flow is more informative than their CEOs’ statements, as Michael Burry shows in his blog Cassandra Unchained.
Synthesis
The capability cliff and the labor bifurcation point to the same thing. Where the work is verifiable end to end (final answer benchmarks, engineering tasks with well-defined scope), AI has commoditized the supply. Where verification depends on the smell test (research grade proofs, novel claims, problem framing), the human role is structural: the scarce resource no one has commoditized.
The operational rule is short: let AI verify your work, and do not let it generate work you cannot verify. The skill behind that rule is the smell test, the mathematician’s feel, trained on years of real mathematical content. That is the part the labor data and the benchmarks both keep revealing as scarce.
Further reading
Talk slides
- When Code Becomes a Commodity: The Mathematician’s Edge — English (PDF) — UFSC Mathematics Department, 24 April 2026
- Quando o Código Vira Commodity: o Trunfo do Matemático — Português (PDF) — Departamento de Matemática da UFSC, 24 de abril de 2026
Primary sources
- Klowden, T. & Tao, T. Mathematical Methods and Human Thought in the Age of AI (arXiv:2603.26524, March 2026)
- Aaronson, S. Ten Signs a Claimed Mathematical Breakthrough is Wrong (Shtetl-Optimized, January 2008)
- Thurston, W. P. On Proof and Progress in Mathematics (arXiv:math/9404236, 1994; reprinted in 18 Unconventional Essays on the Nature of Mathematics, R. Hersh, ed., Springer, 2006)
- Tao, T. Mathematical exploration and discovery at scale (personal blog, 5 November 2025)
- Tao, T. AI is ready for primetime (OpenAI Academy at IPAM, 6 March 2026)
- Tao, T. Mastodon post on AI mathematical capability (@tao on mathstodon.xyz, September 2024)
Capability benchmarks (math)
- De Koninck et al. Open Proof Corpus (arXiv:2506.21621, ETH Zurich, June 2025)
- FormalProofBench (arXiv:2603.26996, March 2026)
- Apple Machine Learning Research. The Illusion of Thinking (2025)
- Self-consistency hallucination detection in LLM reasoning chains (arXiv:2504.09440)
- Diagnosing algebraic reasoning failures: 20-30 parallel-branch ceiling (arXiv:2604.06799)
- DeepMind. AI achieves silver-medal standard at the IMO (2024)
- DeepMind. Gemini Deep Think achieves gold-medal standard at the IMO (2025)
- Epoch AI. FrontierMath benchmark
- DeepMind. AlphaGeometry 2 (arXiv:2502.03544)
- Diez, da Maia, Nourdin. Mathematical research with GPT-5 (arXiv:2509.03065)
- Axiom Math. AxiomProver / Putnam 2025
- Anthropic. Claude Mythos Preview System Card and Project Glasswing
- MATH-500 leaderboard
Labor and economic data
- Brynjolfsson, Chandar & Chen. Canaries in the Coal Mine? Six Facts About the Recent Employment Effects of AI (Stanford Digital Economy Lab, 13 November 2025)
- Hosseini Maasoum & Lichtinger. Generative AI as Seniority-Biased Technological Change (SSRN 5425555, 31 August 2025)
- Westby & Sasser Modestino. The Impact of Generative AI on Job Opportunities for Junior Software Developers (Northeastern, 16 June 2025)
- LeadDev. AI Impact Report 2025 (54% of engineering leaders expect fewer junior hires) — PDF
- US BLS Occupational Employment Statistics: Programmers (15-1251) and Software Developers (15-1252)
- US BLS Occupational Outlook Handbook: Software Developers (+15% through 2034)
- Federal Reserve Bank of New York. The Labor Market for Recent College Graduates (CS graduate unemployment data)
- The Register. UK tech grad hiring crashes 46% as bots do junior work (October 2025; reporting on Institute of Student Employers data)
- Levels.fyi AI Engineer Compensation Trends Q3 2025
- LinkedIn News. Jobs on the Rise 2025
- Stack Overflow. Developer Survey 2025 (84% of developers using or planning to use AI coding tools)
- GeekWire. AI-native startup survey (late 2025) (68% of AI-native startups have AI writing >80% of production code)
Industry signal
- McKinsey. State of AI 2025 (39% of enterprises report measurable EBIT impact)
- IBM Institute for Business Value. CEOs Double Down on AI While Navigating Enterprise Hurdles (May 2025; 2,000 CEOs across 33 countries; 25% of AI initiatives delivered expected ROI)
- Michael Burry. Cassandra Unchained (Substack; cash-flow-driven reading of corporate AI claims)
- Vanalphen / Faktion. Smart Isn’t Enough: Why AI Needs to Understand Like a Human Expert (June 2025)
- Lenny’s Podcast. Head of Claude Code: What happens after coding is solved (Boris Cherny, Feb 2026)
- Fortune. As Klarna flips from AI-first to hiring people again (May 2025)
- Fortune. Prompt engineer title falls off the map (May 2025) (citing Indeed Hiring Lab)
- Federal Reserve. Monitoring AI Adoption in the US Economy (March 2026; null result on AI adoption vs. hiring)
- Tech.co. Companies That Have Replaced Workers with AI in 2025 and 2026 (Salesforce zero hires)
- CNBC. Shopify CEO: Prove AI can’t do jobs before asking for more headcount (April 2025)
- TechCrunch. Microsoft CEO says up to 30% of the company’s code was written by AI (April 2025; context for May 2025 layoffs)
- Computer Weekly. Amazon links planned mass layoff to AI (October 2025)
- OpenAI. Harness Engineering (>1M lines of code with zero human authors)
- India Today. Cardiologist builds AI health platform in 7 days, wins 3rd prize at hackathon (Feb 2026)
Tools and toolkit
- Karpathy, A. Software 3.0 (AI Startup School keynote, 17 June 2025)
- OpenAI. GPT-5 mathematical discovery (Ryu convex-optimization improvement, August 2025)
- Lean 4 official site — language and toolchain
- Mathlib (leanprover-community) and Mathlib stats (1.9M lines, 500+ contributors)
- AlphaProof — olympiad-level formal reasoning (Nature) (peer-reviewed write-up of the IMO-2024 silver result)
- Moonshot AI. Kimina-Prover-Preview (>80% MiniF2F)
- Wolfram. Foundation Tool (callable from Claude / GPT)
- Field, T. & Beale. Vibe Coding for the Fearless (Autodesk University 2025) and APS Blog companion post
- Anthropic pricing, OpenAI pricing, Google AI Ultra subscriptions
- CFA Institute. AI Top of Mind in 2024 Asset Manager Survey (64% of investment professionals upskilling on AI/ML)