When Code Becomes a Commodity: The Mathematician's Edge
In February 2026, a lawyer won the Anthropic Hackathon. A cardiologist took third with an AI care platform built in seven days. Thirteen thousand applicants competed; neither winner wrote code. Over the same window, the US Bureau of Labor Statistics reports programmer roles down 27.5% from 2023 to 2025, while Levels.fyi tracks the new “AI Engineer” specialty paying an 18.7% staff premium and named by LinkedIn the role growing fastest for the second year running.
If software is becoming cheap, what stays scarce?
This piece works through three sources of evidence: AI capability benchmarks in mathematics, US labor market data on engineering roles, and a 2026 paper by Klowden and Tao on the structure of human work alongside AI. The numbers reflect an April 2026 snapshot. The argument is structural, not predictive: the boundary where AI stops working has a shape, and that shape rewards a specific kind of human judgment.
Based on a talk delivered at the Mathematics Department of UFSC on 24 April 2026 to math PhD and master’s students. Slides: English (PDF) · Português (PDF).
Key terms
- IMO (International Mathematical Olympiad): annual high school competition; standard reference benchmark for AI mathematical reasoning.
- FrontierMath Tier 4 (Epoch AI benchmark): research grade mathematics problems designed to remain unsaturated by current models.
- Lean 4 / Mathlib: a formal proof language and its core library; Mathlib carries roughly 1.9 million lines of verified mathematics maintained by 500-plus contributors.
- Smell test (Klowden & Tao §4.2): the skepticism mathematicians apply before reading a proof line by line, used to detect bad arguments early.
- Odorless proof (K&T §4.6 n16): a formally correct output that lacks the causal narrative humans use to judge mathematical work.
- Blue team / red team (K&T §6.2): generation versus verification roles, borrowed from cybersecurity. K&T argue AI is relatively safe on the red team, unsafe on the blue.
- Software 3.0 (Karpathy, June 2025): software produced by description in natural language, distinct from code humans write (1.0) or models humans train (2.0).
Why this time is different
Past automations affected the periphery of intellectual work: the printing press, the typewriter, LaTeX, automated dissemination. They changed how ideas were copied, transmitted, and rendered. The work of generating ideas, of deciding what to write, stayed human. Klowden and Tao open their 2026 paper with this distinction. Modern AI is the first general purpose technology that automates the creation step itself, not just the dissemination around it.
The point matters for what follows. If a tool only automates dissemination, the question “what is scarce” stays trivially answered: the production of new ideas. Once the tool starts producing outputs that look like ideas (proofs, drafts, code, diagrams), the question becomes structural. Rhetoric will not answer it. The honest place to look is where the tool stops working.
Two benchmark surfaces, very different scores
By April 2026, leading models have effectively saturated competition mathematics and stalled on research grade work.
| Benchmark | Type | Frontier score (April 2026) |
|---|---|---|
| MATH-500 | High school + early college | 95-99% (Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, DeepSeek-R1) |
| AIME 2025 | US olympiad qualifier | 100% with code execution; 99.8% without |
| GSM-8K | Grade school word problems | ~100% |
| IMO 2025 | International Olympiad | 35/42, gold medal level (Gemini Deep Think, certified by coordinators) |
| FrontierMath Tier 4 | Research grade | ~38% (GPT-5.4 Pro, March 2026; up from 19% five months earlier) |
| FormalProofBench (Lean 4) | Verified proofs | ~33.5% (frontier model) |
| Open Proof Corpus | Olympiad proofs, graded by humans | ~43% pass full human inspection |
What it shows: a roughly 60 percentage point gap between problems where only the final answer matters and problems where the full reasoning matters. The gap is closing on the easier side and persisting on the harder one.
The IMO is the cleanest single illustration. In 2024, DeepMind’s AlphaProof scored 28/42 using formal Lean 4 proofs and several days of compute per problem. In 2025, Gemini Deep Think scored 35/42 in natural language inside the 4.5 hour exam window, certified by IMO coordinators.
One year between a specialist in formal language and a general purpose reasoner at gold medal level. The same pattern holds across the standard benchmarks: MATH-500, AIME 2025, GSM-8K. A few years ago these were comfortably out of reach. Today they no longer differentiate between the leading models. They are silent.
If you stop at the press releases, the story sounds triumphant. The interesting work starts below the cliff.
What the cliff is made of
The Open Proof Corpus, a 2025 benchmark of 5,062 olympiad proofs evaluated by thirteen human graders across six frontier models, isolates the structure of the gap. Producing a correct final answer outperforms producing a correct full proof by roughly 30 percentage points across the board. Aggregated, only about 43% of proofs produced by models pass human inspection. The 2026 FormalProofBench update confirms the gap persists: the best model reaches 33.5% on formal Lean proofs while the same model approaches 100% on final answer benchmarks. The cliff is not closing on the side that matters for research.
Three failure modes recur in the recent literature:
- A single hallucinated step invalidates an entire chain of derivation.
- Working memory collapses around 20 to 30 parallel branches regardless of model scale; the ceiling does not move with parameter count.
- Apple’s research group documented “complete accuracy collapse beyond certain complexities” in 2025: reasoning effort grows for a while, then declines, as if the system gives up.
Klowden and Tao name the deepest version of this failure “odorless proofs.” AlphaProof’s IMO 2024 solutions were verified correct by Lean. They also contained redundant steps and inexplicable jumps. The proofs were formally right without being legibly right: they lacked what Thurston (1994/2006) identified as the mark of good mathematics, a causal narrative that lets a reader see why the conclusion follows, not just that it does. The system passes the verifier and fails the smell test the mathematician would apply before opening Lean.
That distinction sets up the rest of the argument. The labor market is registering the same boundary from a different angle.
The labor data, read alongside the cliff
A note on framing. This analysis reads the software engineering job market through eight years of industry experience, including machine learning consulting and quantitative finance. The mathematical material is closer to home than the early empirical material; the bias is flagged because the corporate signal cases below skew toward what the author has personally observed.
The empirical record on junior software engineering is now thick.
| Source | Population | Finding |
|---|---|---|
| Stanford Digital Economy Lab (ADP) | US developers 22-25 | -20% employment from late-2022 peak |
| Harvard / Revelio Labs | 62M workers, 285K firms | Junior employment in firms that adopt GenAI -7.7% after 6 quarters vs. controls |
| Northeastern (Lightcast) | US vacancies | -16.3% relative junior to senior posting ratio |
| Institute of Student Employers (UK) | Entry level tech | -46% over 2024 |
| NY Fed College Labor Market | CS graduates | unemployment ~6.1% |
| Levels.fyi | “AI Engineer” specialty | +18.7% staff compensation premium |
| US BLS | Software developers (aggregate) | Projected +15% growth through 2034 |
| LeadDev AI Impact Report (2025) | Engineering leaders | 54% expect fewer junior hires over the long term |
What it shows: bifurcation, not extinction. The programmer role focused on execution is shrinking; the design oriented role that collaborates with AI is expanding. Aggregate growth and the collapse at entry level are both true at once. They describe two different jobs that happen to share one historical label.
Corporate signaling lines up with the data, though the signal is mixed with narrative. Salesforce announced zero new engineer hires for fiscal year 2026. Shopify added a policy requiring written justification that AI cannot do the work before any new headcount can be approved. Microsoft attributed roughly 6,000 layoffs in May 2025 partly to internal AI productivity, citing 20 to 30% of new code being written by AI. Amazon announced 14,000 corporate cuts in October 2025 with AI cited.
A counter finding tempers the picture. The Federal Reserve, in a March 2026 study covering more than a million firms, found no statistically significant link between AI adoption and aggregate hiring. Klarna, the headline case for AI driven workforce reduction, cut headcount from roughly 5,500 to 3,000 while pushing AI, watched revenue per employee multiply about fourfold from $300K to $1.3M, and then partially reversed in May 2025, rehiring humans after the quality of customer facing work fell.
In summary, the labor data does not yet support strong claims in either direction. Where AI substitutes for human work, where it does not, and how much of the recent shift can be attributed to AI specifically are open questions the evidence has not settled. One personal data point, offered as such. A treasury role the author previously led with two or three colleagues is, today, feasible alone with AI. In a separate context, the author currently works as a solo risk scientist who is also the project’s main engineer. It is the kind of perception that macro evidence will eventually have to confirm or rule out.
Where the bottleneck moved
McKinsey’s 2025 State of AI report finds that only 39% of enterprises that adopted generative AI report measurable EBIT (earnings before interest and taxes) impact. IBM’s institute reports that one in four AI projects delivers promised returns. Faktion’s June 2025 essay diagnoses the recurring failure mode: “knowledge is scattered across documentation, legacy systems, and most critically, in the heads of a few seasoned experts.” Boris Cherny, head of Claude Code at Anthropic, on Lenny’s Podcast in February 2026: “the core problem now is knowing what the hell you actually want to build.”
The constraint has shifted from engineering capacity to problem framing, domain knowledge, and the ability to say what success looks like in a setting the AI cannot infer. The visible employer demand follows that shift:
- Alignment research at Anthropic and the MATS program, hiring for reward hacking analysis and work on training dynamics.
- Formal verification specialists, $70-150 per hour through Alignerr, requiring Lean 4 and Mathlib fluency.
- ML theory: a thin field with strong appetite for interpretability and generalization work, where the absence of a settled mathematical framework is itself the opportunity.
- Quantitative finance: stochastic modeling combined with machine learning, sustained demand for quants who can do both.
- Operations research: the work of formulating problems. Defining variables, constraints, and the objective function. That formulation step is exactly what the AI cannot do without a human who knows what is being optimized.
These are domains where rigor and proof cannot be automated, and where AI amplifies the mathematician without substituting for the mathematician’s judgment. The bifurcation in the labor data and the cliff in the capability data point to the same gap.
The smell test, formalized
Mathematicians have a name for the missing skill, though it does not appear in any AI lab’s deck. Scott Aaronson’s 2008 blog post Ten Signs a Claimed Mathematical Breakthrough is Wrong lists ten heuristics experienced mathematicians apply, often subconsciously, to detect bad arguments well before line by line verification:
- The authors do not use TeX.
- The authors do not understand the question.
- The approach implies something much stronger and probably false.
- The approach contradicts a known impossibility result.
- The authors quietly switch from “we prove” to “seems to work in all cases we tried” partway through.
- The paper jumps to technicalities without a new idea.
- The paper does not build on prior work.
- The paper wastes space defining standard terms.
- The paper waxes poetic about practical implications.
- The techniques look too weak for the problem.
Klowden and Tao §4.2 call this the “smell test” and argue it is the layer of mathematical practice most resistant to current model architectures. The argument is mechanical, not mystical. LLMs are trained to produce outputs that look correct, which is exactly the failure mode the smell test was built to catch. Several of Aaronson’s signs (the weasel words at #5, the technicalities without idea at #6, the rhetoric about implications at #9) are patterns current models exhibit by construction. Catching them takes three things:
- Deep familiarity with prior work.
- Intuition for which kinds of claims tend to be true.
- A feel for how mathematical communities react to assertions.
None of these are reliably produced by next token training. This explains the cliff’s location. The cliff is not arbitrary. It tracks the boundary between problems where the smell test is unnecessary (final answer arithmetic at AIME level) and problems where the smell test is the skill that holds the work up (research grade proof verification, novel claim assessment, paper refereeing). Scaling raw capability has not closed that boundary, and the architecture being scaled does not target it.
The current toolkit
The frontier of “AI in math research” is concrete, not hypothetical. November 2025: Terence Tao published a public write up describing a year of collaboration with Google DeepMind. AlphaEvolve plus Gemini Deep Think were applied to 67 open mathematical problems across combinatorics, analysis, and number theory, producing roughly 20 genuinely new results, with a subset of proofs passing formal verification in Lean 4. The pipeline is concrete:
flowchart LR
A[AlphaEvolve
discovers candidate] --> B[Deep Think
sketches proof]
B --> C[AlphaProof
formalizes in Lean]
C --> D[Lean 4
verified]
Tao selects and evaluates the problems at the start of the loop and judges the resulting insight at the end. The middle is automated. The boundaries are not.
A more accessible version of the same shift: in August 2025, Ernest Ryu (Seoul National University and UCLA) used GPT-5 as a research collaborator and improved a forty year old bound in operator splitting and first-order convex optimization by roughly 50%, published as a rigorous note rather than a heuristic. Same pattern. The human picks the problem and judges the insight; the machine accelerates the middle.
The infrastructure is open and documented. Lean 4 with Mathlib carries roughly 1.9 million lines of formally verified mathematics. LeanCopilot from Caltech suggests Lean tactics during proofs. DeepSeek-Prover-V2 decomposes problems into subgoals. Kimina-Prover from Moonshot was the first prover to clear 80% on MiniF2F. None of this requires special access. The frontier only systems (AlphaProof, AlphaEvolve, the gated Claude Mythos preview that scored 97.6% on USAMO 2026 inside Anthropic’s Project Glasswing) extend the ceiling. The public toolchain is enough to make a research contribution today.
A secondary shock matters here, especially for people who do not write code. Andrej Karpathy’s “Software 3.0” framing from June 2025 names a new mode: programs humans describe in natural language.
- Software 1.0: code humans write.
- Software 2.0: models humans train.
- Software 3.0: software produced by description in natural language.
Trent Field, who does not write code professionally, works at Arkance and presented at Autodesk University 2025 a working web application he built solo in 48 hours using Cursor. The lawyer and cardiologist from the opening are not exceptions. They are the early end of a pattern.
The operating rule
Klowden and Tao §6.2 propose a rule borrowed from cybersecurity. AI is relatively safe to use as a red team (reviewing content generated by humans for errors or suggested improvements) and unsafe to trust as a blue team (generating structural content beyond what a red team can verify). The red team can be human reviewers, formal proof assistants like Lean, or other AI systems running in a verification capacity. What the blue team produces must remain inside what the red team can check.
The rule is easy to state and easy to violate. While drafting the source talk for this analysis, the author asked an AI to write a slide explaining the K&T framing. It produced “AI is potent on the blue team, the red team is a human function,” which inverts the paper’s claim. The author caught the inversion on first read. Two further rounds of feedback to the AI failed to surface it; the wrong slide kept coming back. K&T’s thesis demonstrated itself: the smell test is the layer no current model reliably has, and it is what produced the corrected paragraph.
A second instance of the same effect surfaced after the talk. A graduate student who attended pointed out that the slides used em-dashes liberally, a tell for prose authored by AI. He was not checking the math. He was applying the smell test to the medium and reading the signature on the surface.
The operating rule, then, in two sentences. Let AI verify your work, not generate work you cannot verify. The skepticism that comes before verification is the part the workflow has to supply, because the model will not.
Limitations
The benchmark numbers in this piece reflect April 2026. AI capability claims age fast; expect the specific figures to be out of date within months (or weeks). Tao’s own public characterization of leading models moved across eighteen months:
The rate of motion is not slowing.
The labor data is heavily weighted to the US. The Federal Reserve null result and the Klarna reversal show real heterogeneity in how AI adoption translates to headcount. The mathematical edge framing presumes the smell test stays out of reach for current architectures. That is a claim about how today’s models are trained, not a prediction about what a future model could do. If the architecture changes, the structural argument has to be reconsidered; the labor implications would shift before the capability evidence catches up.
A final caveat. The corporate signaling cases (Salesforce, Shopify, Microsoft, Amazon) carry narrative weight and stock price incentives in addition to hiring intent. CEO public statements are written for investors first; whether they reflect operational reality is a separate empirical question. The two recurring gaps are deliberate misstatement and selective framing. Selective framing looks like efficiency gains attributed to AI when other cost programs were the larger driver, or headcount changes attributed to automation when attrition was already scheduled. Read between the lines. A good way to do that is to follow the cash flow of these companies, as Michael Burry does on his Cassandra Unchained blog.
Takeaway
The capability cliff and the labor bifurcation point to the same thing. Where the work is verifiable end to end (final answer benchmarks, well scoped engineering tasks), AI has commoditized the supply. Where verification depends on the smell test (research grade proofs, novel claims, problem framing), the human role is structural, not transitional.
The operational rule is short. Let AI verify your work, and do not let it generate work you cannot verify. The skill behind that rule is the skepticism a mathematician applies before reaching for verification tools, trained on years of real mathematical content. That is the part the labor data and the benchmarks both keep revealing as scarce.
Further reading
Talk slides
- When Code Becomes a Commodity: The Mathematician’s Edge — English (PDF) — UFSC Mathematics Department, 24 April 2026
- Quando o Código Vira Commodity: o Trunfo do Matemático — Português (PDF) — Departamento de Matemática da UFSC, 24 de abril de 2026
Primary sources
- Klowden, T. & Tao, T. Mathematical Methods and Human Thought in the Age of AI (arXiv:2603.26524, March 2026)
- Aaronson, S. Ten Signs a Claimed Mathematical Breakthrough is Wrong (Shtetl-Optimized, January 2008)
- Thurston, W. P. On Proof and Progress in Mathematics (arXiv:math/9404236, 1994; reprinted in 18 Unconventional Essays on the Nature of Mathematics, R. Hersh, ed., Springer, 2006)
- Tao, T. Mathematical exploration and discovery at scale (personal blog, 5 November 2025)
- Tao, T. AI is ready for primetime (OpenAI Academy at IPAM, 6 March 2026)
- Tao, T. Mastodon post on AI mathematical capability (@tao on mathstodon.xyz, September 2024)
Capability benchmarks (math)
- De Koninck et al. Open Proof Corpus (arXiv:2506.21621, ETH Zurich, June 2025)
- FormalProofBench (arXiv:2603.26996, March 2026)
- Apple Machine Learning Research. The Illusion of Thinking (2025)
- Self-consistency hallucination detection in LLM reasoning chains (arXiv:2504.09440)
- Diagnosing algebraic reasoning failures: 20-30 parallel-branch ceiling (arXiv:2604.06799)
- DeepMind. AI achieves silver-medal standard at the IMO (2024)
- DeepMind. Gemini Deep Think achieves gold-medal standard at the IMO (2025)
- Epoch AI. FrontierMath benchmark
- DeepMind. AlphaGeometry 2 (arXiv:2502.03544)
- Diez, da Maia, Nourdin. Mathematical research with GPT-5 (arXiv:2509.03065)
- Axiom Math. AxiomProver / Putnam 2025
- Anthropic. Claude Mythos Preview System Card and Project Glasswing
- MATH-500 leaderboard
Labor and economic data
- Brynjolfsson, Chandar & Chen. Canaries in the Coal Mine? Six Facts About the Recent Employment Effects of AI (Stanford Digital Economy Lab, 13 November 2025)
- Hosseini Maasoum & Lichtinger. Generative AI as Seniority-Biased Technological Change (SSRN 5425555, 31 August 2025)
- Westby & Sasser Modestino. The Impact of Generative AI on Job Opportunities for Junior Software Developers (Northeastern, 16 June 2025)
- LeadDev. AI Impact Report 2025 (54% of engineering leaders expect fewer junior hires) — PDF
- US BLS Occupational Employment Statistics: Programmers (15-1251) and Software Developers (15-1252)
- US BLS Occupational Outlook Handbook: Software Developers (+15% through 2034)
- Federal Reserve Bank of New York. The Labor Market for Recent College Graduates (CS graduate unemployment data)
- The Register. UK tech grad hiring crashes 46% as bots do junior work (October 2025; reporting on Institute of Student Employers data)
- Levels.fyi AI Engineer Compensation Trends Q3 2025
- LinkedIn News. Jobs on the Rise 2025
- Stack Overflow. Developer Survey 2025 (84% of developers using or planning to use AI coding tools)
- GeekWire. AI-native startup survey (late 2025) (68% of AI-native startups have AI writing >80% of production code)
Industry signal
- McKinsey. State of AI 2025 (39% of enterprises report measurable EBIT impact)
- IBM Institute for Business Value. CEOs Double Down on AI While Navigating Enterprise Hurdles (May 2025; 2,000 CEOs across 33 countries; 25% of AI initiatives delivered expected ROI)
- Michael Burry. Cassandra Unchained (Substack; cash-flow-driven reading of corporate AI claims)
- Vanalphen / Faktion. Smart Isn’t Enough: Why AI Needs to Understand Like a Human Expert (June 2025)
- Lenny’s Podcast. Head of Claude Code: What happens after coding is solved (Boris Cherny, Feb 2026)
- Fortune. As Klarna flips from AI-first to hiring people again (May 2025)
- Fortune. Prompt engineer title falls off the map (May 2025) (citing Indeed Hiring Lab)
- Federal Reserve. Monitoring AI Adoption in the US Economy (March 2026; null result on AI adoption vs. hiring)
- Tech.co. Companies That Have Replaced Workers with AI in 2025 and 2026 (Salesforce zero hires)
- CNBC. Shopify CEO: Prove AI can’t do jobs before asking for more headcount (April 2025)
- TechCrunch. Microsoft CEO says up to 30% of the company’s code was written by AI (April 2025; context for May 2025 layoffs)
- Computer Weekly. Amazon links planned mass layoff to AI (October 2025)
- OpenAI. Harness Engineering (>1M lines of code with zero human authors)
- India Today. Cardiologist builds AI health platform in 7 days, wins 3rd prize at hackathon (Feb 2026)
Tools and toolkit
- Karpathy, A. Software 3.0 (AI Startup School keynote, 17 June 2025)
- OpenAI. GPT-5 mathematical discovery (Ryu convex-optimization improvement, August 2025)
- Lean 4 official site — language and toolchain
- Mathlib (leanprover-community) and Mathlib stats (1.9M lines, 500+ contributors)
- AlphaProof — olympiad-level formal reasoning (Nature) (peer-reviewed write-up of the IMO-2024 silver result)
- Moonshot AI. Kimina-Prover-Preview (>80% MiniF2F)
- Wolfram. Foundation Tool (callable from Claude / GPT)
- Field, T. & Beale. Vibe Coding for the Fearless (Autodesk University 2025) and APS Blog companion post
- Anthropic pricing, OpenAI pricing, Google AI Ultra subscriptions
- CFA Institute. AI Top of Mind in 2024 Asset Manager Survey (64% of investment professionals upskilling on AI/ML)