In February 2026, a lawyer won the Anthropic Hackathon. A cardiologist took third with an AI care platform built in seven days. Thirteen thousand applicants competed; neither winner wrote code. Over the same window, the US Bureau of Labor Statistics reports programmer roles down 27.5% from 2023 to 2025, while Levels.fyi tracks the new “AI Engineer” specialty paying an 18.7% staff premium and named by LinkedIn the role growing fastest for the second year running.

That raises the question: if code is becoming a commodity, what stays scarce?

This piece works through three sources of evidence: AI capability benchmarks in mathematics, US labor market data on engineering roles, and a 2026 paper by Klowden and Tao on the structure of human work alongside AI. The numbers reflect an April 2026 snapshot. The argument is structural, not predictive: the boundary where AI stops working has a shape, and that shape rewards a specific kind of human judgment.

Based on a talk delivered at the Mathematics Department of UFSC on 24 April 2026 to math PhD and master’s students. Slides: English (PDF) · Português (PDF).

Key terms

  • IMO (International Mathematical Olympiad): annual high school competition; standard reference benchmark for AI mathematical reasoning.
  • USAMO (USA Mathematical Olympiad): the US national qualifier for the IMO; proof-based problems graded by human judges.
  • FrontierMath Tier 4 (Epoch AI benchmark): research grade mathematics problems designed to remain unsaturated by current models.
  • Lean 4 / Mathlib: a formal proof language and its core library; Mathlib carries roughly 1.9 million lines of verified mathematics maintained by 500-plus contributors.
  • Smell test (Klowden & Tao §4.2): the skepticism mathematicians apply before reading a proof line by line, used to detect bad arguments early.
  • Odorless proof (K&T §4.6 n16): a formally correct output that lacks the causal narrative humans use to judge mathematical work.
  • Blue team / red team (K&T §6.2): generation versus verification roles, borrowed from cybersecurity. K&T argue AI is relatively safe on the red team, unsafe on the blue.
  • Software 3.0 (Karpathy, June 2025): software produced by description in natural language, distinct from code humans write (1.0) or models humans train (2.0).

Why this time is different

Past automations affected the periphery of intellectual work: the printing press, the typewriter, LaTeX, automated dissemination. They changed how ideas were copied, transmitted, and rendered. The work of generating ideas, of deciding what to write, stayed human. Klowden and Tao open their 2026 paper with this distinction. Modern AI is the first general purpose technology that automates the creation step itself, not just the dissemination around it.

The point matters for what follows. If a tool only automates dissemination, the question “what is scarce” stays trivially answered: the production of new ideas. Once the tool starts producing outputs that look like ideas (proofs, drafts, code, diagrams), the question becomes structural. Rhetoric will not answer it. The honest place to look is where the tool stops working.

Criterion and difficulty: the gap has two sources

By April 2026, leading models saturate above 99% on competition mathematics when only the final answer is graded. They lose ground along two axes: when the criterion shifts to graded proofs, and when the difficulty rises to research level.

BenchmarkTypeCriterion gradedScore (April 2026)
GSM-8KGrade schoolfinal answer~100%
AIME 2025US olympiad qualifierfinal answer (integer)100% with code; 99.8% without
MATH-500High school + early collegefinal answer95-99% (Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, DeepSeek-R1)
FrontierMath Tier 4Research gradefinal answer~38% (GPT-5.4 Pro, March 2026; up from 19% five months earlier)
IMO 2025International Olympiadfull proof, IMO judges35/42 ≈ 83% (Gemini Deep Think)
Open Proof CorpusOlympiad proofsfull proof, human-graded~43% (5,062 proofs, 6 LLMs, 13 judges)
FormalProofBench (Lean 4)Graduate proofsformal proof in Lean~33.5% (frontier model)

What the table shows: two factors push the scores down at once.

On the same problem set, changing only the criterion (from “final answer correct” to “full proof correct”) is enough to drop the score. Over the same 5,062 problems, the Open Proof Corpus measured drops of 8 percentage points (Gemini-2.5-Pro) to 30 points (o3), depending on the model.

Holding the final-answer criterion fixed, difficulty weighs even more. AIME 2025 (olympiad qualifier) clears 99%; FrontierMath Tier 4, with the same criterion but research-grade problems, sits at ~38%. That is roughly 60 percentage points coming from the difficulty jump alone.

Crossing the two axes, Open Proof Corpus lands at 43% (olympiad problems with human-graded proofs) and FormalProofBench at 33.5% (graduate problems with formal Lean proofs).

The difference is not measurement noise. It tracks two boundaries at once: criterion (right answer vs verifiable proof) and difficulty (solvable problem vs research problem). The rest of this piece is organized around these two axes.

IMO 2025 is the most visible crossing of the two axes: difficulty at the top of the pre-college olympiad pyramid, with proofs graded by the official judges reading the full argument. What stands out is that, even at this combination, the 35/42 ≈ 83% scored by Gemini Deep Think and the experimental OpenAI model already sits at gold medal level, and the jump to that mark was fast. In 2024, DeepMind’s AlphaProof scored 28/42 (silver medal) with formal Lean 4 proofs and several days of compute per problem. In 2025, Gemini Deep Think reached 35/42 in natural language inside the 4.5 hour exam window, certified by IMO coordinators. Twelve months later, what required formal Lean and days of compute now fits in natural language within the exam window. Even so, problem 6 went unsolved, and the gap to AIME’s 99.8% keeps the criterion as a live boundary.

Capability cliff: competition benchmarks (GSM8K, MATH-500, AIME 2025, Putnam 2025) sit near 100% while research-grade benchmarks (FormalProofBench, FrontierMath Tier 4) sit at 33-38%. The vertical drop is roughly 62 percentage points.
*Putnam 2025: AxiomProver reached 12/12 with post-window compute; 8/12 within the 6-hour competition window.

Final-answer benchmarks at olympiad level (MATH-500, AIME 2025, GSM-8K) no longer differentiate the leading models. A few years ago they sat comfortably out of reach. Today they are the baseline.

News of AI progress arrives in trumpets. But seen from the side of full proofs and research problems, the music is still violins. The interesting work starts below the cliff.

What the cliff is made of

The previous section’s numbers say the cliff is there; they do not say how it forms. Recent literature on reasoning failures in frontier models points to three recurring modes:

Klowden and Tao name the deepest version of this failure “odorless proofs.” AlphaProof’s IMO 2024 solutions were verified correct by Lean. They also contained redundant steps and inexplicable jumps. The proofs were formally right without being legibly right: they lacked what Thurston (1994/2006) identified as the mark of good mathematics, a causal narrative that lets a reader see why the conclusion follows, not just that it does. The system passes the verifier and fails the smell test the mathematician would apply before opening Lean.

That distinction sets up the rest of the argument. The labor market is registering the same boundary from a different angle.

The labor data, read alongside the cliff

A note on framing. This analysis reads the software engineering job market through eight years of industry experience, including machine learning consulting and quantitative finance. The mathematical material is closer to home than the early empirical material; the bias is flagged because the corporate signal cases below skew toward what the author has personally observed.

The empirical record on junior software engineering is now thick.

SourcePopulationFinding
Stanford Digital Economy Lab (ADP)US developers 22-25-20% employment from late-2022 peak
Harvard / Revelio Labs62M workers, 285K firmsJunior employment in firms that adopt GenAI -7.7% after 6 quarters vs. controls
Northeastern (Lightcast)US vacancies-16.3% relative junior to senior posting ratio
Institute of Student Employers (UK)Entry level tech-46% over 2024
NY Fed College Labor MarketCS graduatesunemployment ~6.1%
Levels.fyi“AI Engineer” specialty+18.7% staff compensation premium
US BLSSoftware developers (aggregate)Projected +15% growth through 2034
LeadDev AI Impact Report (2025)Engineering leaders54% expect fewer junior hires over the long term

What it shows: bifurcation, not extinction. The programmer role focused on execution is shrinking; the design oriented role that collaborates with AI is expanding. These are two different jobs that happen to share one historical label.

Corporate signaling lines up with the data, though the signal is mixed with narrative. Salesforce announced zero new engineer hires for fiscal year 2026. Shopify added a policy requiring written justification that AI cannot do the work before any new headcount can be approved. Microsoft attributed roughly 6,000 layoffs in May 2025 partly to internal AI productivity, citing 20 to 30% of new code being written by AI. Amazon announced 14,000 corporate cuts in October 2025 with AI cited.

But there are signs that complicate the picture. The Federal Reserve, in a March 2026 study covering more than a million firms, found no statistically significant link between AI adoption and aggregate hiring. Klarna, the headline case for AI driven workforce reduction, cut headcount from roughly 5,500 to 3,000 while pushing AI, watched revenue per employee multiply about fourfold from $300K to $1.3M, and then partially reversed in May 2025, rehiring humans after customer-service quality dropped (no surprises there). The BLS itself projects +15% aggregate growth for the sector through 2034.

But in fact, there is little statistical support for any conclusive claim. Where AI can effectively substitute for a human, and how much of the recent shift can be attributed specifically to it, are open questions that evidence and time have not yet clarified.

A personal observation. A treasury operation the author previously managed and developed with two or three colleagues today fits in one person, with AI. More recently, he works as a solo risk scientist, also serving as the project’s lead engineer. It is the kind of observation that the macro evidence will eventually have to confirm or discard. The anecdote points in one direction: engineering capacity, which used to be the main bottleneck, has become a commodity.

Where the bottleneck moved

McKinsey’s 2025 State of AI report finds that only 39% of enterprises that adopted generative AI report measurable EBIT (earnings before interest and taxes) impact. IBM’s institute reports that one in four AI projects delivers promised returns. Faktion’s June 2025 essay diagnoses the recurring failure mode: “knowledge is scattered across documentation, legacy systems, and most critically, in the heads of a few seasoned experts.” Boris Cherny, head of Claude Code at Anthropic, on Lenny’s Podcast in February 2026: “the core problem now is knowing what the hell you actually want to build.”

Both signals converge on the same place: when engineering stops being the bottleneck, what remains is the work of knowing what to build, a role carried mainly by domain specialists.

Stack-flip diagram: before AI, the bottleneck was Engineering Capacity (with Domain Knowledge and System Design above); after AI, Engineering Capacity is commoditized and the bottleneck moves up to Domain Knowledge.

The constraint has shifted from engineering capacity to problem framing, domain knowledge, and the ability to say what success looks like in a setting the AI cannot infer. Five niches in mathematics make that boundary concrete (qualitative selection, based on what the author observes):

  • Alignment research at Anthropic and the MATS program, hiring for reward hacking analysis and work on training dynamics.
  • Formal verification specialists, $70-150 per hour through Alignerr, requiring Lean 4 and Mathlib fluency.
  • ML theory: a thin field with strong appetite for interpretability and generalization work, where the absence of a settled mathematical framework is itself the opportunity.
  • Quantitative finance: stochastic modeling combined with machine learning, sustained demand for quants who can do both.
  • Operations research: the work of formulating problems. Defining variables, constraints, and the objective function. That formulation step is exactly what the AI cannot do without a human who knows what is being optimized.

These are domains where rigor and proof cannot be automated, and where AI amplifies the mathematician without substituting for the mathematician’s judgment. The bifurcation in the labor data and the cliff in the capability data point to the same gap.

The smell test, formalized

Mathematicians have a name for the missing skill, though it does not appear in any AI lab’s deck. Scott Aaronson’s 2008 blog post Ten Signs a Claimed Mathematical Breakthrough is Wrong lists ten heuristics experienced mathematicians apply, often subconsciously, to detect bad arguments well before line by line verification:

  1. The authors do not use TeX.
  2. The authors do not understand the question.
  3. The approach implies something much stronger and probably false.
  4. The approach contradicts a known impossibility result.
  5. The authors quietly switch from “we prove” to “seems to work in all cases we tried” partway through.
  6. The paper jumps to technicalities without a new idea.
  7. The paper does not build on prior work.
  8. The paper wastes space defining standard terms.
  9. The paper waxes poetic about “practical consequences,” “profound philosophical implications,” etc.
  10. The techniques look too weak for the problem.

Klowden and Tao §4.2 call this the “smell test” and argue it is the layer of mathematical practice most resistant to current AI architectures. The argument is mechanical, not magical. LLMs are trained to produce outputs that look correct, which is exactly the failure mode the smell test was constructed to detect. Several of Aaronson’s signs (the weasel words at #5, the technicalities without idea at #6, the rhetoric about implications at #9) are patterns that the current models exhibit by construction. Detecting them takes three things:

  • Deep familiarity with prior work.
  • Intuition for which kinds of claims tend to be true.
  • A feel for how mathematicians react to a new claim.

None of these are reliably produced by current LLMs. This explains the cliff’s location. The cliff is not arbitrary. It tracks the boundary between problems where the smell test is unnecessary (final answer arithmetic at AIME level) and problems where the smell test is the skill that holds the work up (research grade proof verification, novel claim assessment, paper refereeing). Scaling raw model capability has not closed this boundary, because current architectures were not designed to address it.

The current toolkit

The frontier of “AI in math research” is concrete, not hypothetical. November 2025: Terence Tao published a public write up describing a year of collaboration with Google DeepMind. AlphaEvolve plus Gemini Deep Think were applied to 67 open mathematical problems across combinatorics, analysis, and number theory, producing roughly 20 genuinely new results, with a subset of proofs passing formal verification in Lean 4. The workflow adopted was the following:

flowchart LR
    A[AlphaEvolve
discovers candidate] --> B[Deep Think
sketches proof] B --> C[AlphaProof
formalizes in Lean] C --> D[Lean 4
verified]

Tao selects and evaluates the problems at the start of the loop and judges the resulting insight at the end. The middle is automated. The boundaries are not.

The same pattern at a more accessible scale: in August 2025, Ernest Ryu (Seoul National University, formerly UCLA) used GPT-5 as an assistant and improved by about 50% the best known bound on a 40-year open problem on operator splitting and first-order methods in convex optimization. The result was published as a rigorous note, not just a heuristic.

Same pattern: the human picks the problem and judges the insight; the machine works on the intermediate steps.

The infrastructure is open and documented. Lean 4 with Mathlib carries roughly 1.9 million lines of formally verified mathematics. LeanCopilot from Caltech suggests Lean tactics during proofs. DeepSeek-Prover-V2 splits problems into smaller ones (divide and conquer). Kimina-Prover from Moonshot was the first prover to clear 80% on MiniF2F. None of this requires special access.

Gated-access systems extend the ceiling: AlphaProof, AlphaEvolve, and the Claude Mythos preview in Anthropic’s Project Glasswing (which reported 97.6% on USAMO 2026, with no independently graded score published yet). Even so, the public toolchain already serves as a capable assistant for a mathematician doing research-grade work.

A second point worth mentioning, especially for those who don’t write code. Andrej Karpathy’s “Software 3.0” framing from June 2025 names a new mode: programs humans describe in natural language.

  • Software 1.0: code humans write.
  • Software 2.0: models humans train.
  • Software 3.0: software produced by description in natural language.

Trent Field, who does not write code professionally, works at Arkance and presented at Autodesk University 2025 a working web application he built solo in 48 hours using Cursor. The lawyer, the cardiologist, and the mathematicians proving new results in collaboration with AI are not exceptions; they mark the same pattern: AI has crossed from dissemination into creation. The arrangement only works as long as the human can verify what the AI produces.

The operating rule

Klowden and Tao §6.2 propose a rule with vocabulary borrowed from cybersecurity. AI is relatively safe to use as a red team (reviewing content generated by humans for errors or suggested improvements) and unsafe to trust as a blue team (generating structural content beyond what a red team can verify). The red team can be human reviewers, formal proof assistants like Lean, or other AI systems running in a verification capacity. What the blue team produces must remain inside what the red team can check.

The rule is easy to state and easy to violate. While drafting the talk that originated this article, the author asked an AI to write a slide explaining the K&T framing. It produced “AI is potent on the blue team, the red team is a human function,” which mixes up the concepts in K&T’s thesis. The author noticed the confusion on first reading. Two further rounds of feedback to the AI failed to surface it; the wrong slide kept coming back. K&T’s thesis demonstrated itself: the smell test is the layer no current model reliably has, and it is what produced the corrected paragraph.

A second instance of the same effect surfaced after the talk. A graduate student who attended pointed out that the slides used em-dashes liberally, a tell of AI-generated content. He was not checking the math. He was applying the smell test to the medium and reading the signature on the surface.

The operating rule, then, in two sentences:

  1. Let AI verify your work, not generate work you can’t verify.

  2. When verifying AI work, apply your mathematical smell test to invalidate arguments; the model doesn’t have it.

Limitations

The benchmark scores in this text are a snapshot up to April 18, 2026 (GPT-5.5, released April 23, and DeepSeek V4, in preview since April 24, are therefore outside this analysis). That said, it’s important to acknowledge that such numbers go out of date quickly; expect the specific figures to be out of date within weeks. Tao’s own public characterization of AI for research moved across eighteen months:

Tao's 18-month language shift on frontier model capability: September 2024 'a mediocre, but not completely incompetent, graduate student', November 2025 collaboration with DeepMind producing ~20 new results across 67 open problems, March 2026 'ready for primetime' at OpenAI Academy / IPAM.

The rate of motion is not slowing. The release of GPT-5.4 to GPT-5.5, for example, took roughly seven weeks (March 5 to April 23, 2026), and FrontierMath Tier 4 jumped from 38% to 39.6% (GPT-5.5 Pro) in that interval.

The labor data presented in this text is heavily concentrated in the US. The Federal Reserve null result and the Klarna reversal show real heterogeneity in how AI adoption translates to headcount. The mathematical edge framing presumes the smell test stays out of reach for current architectures. That is a claim about how today’s models are trained, not a prediction about what a future model will be able to do. If the architecture changes, the structural argument has to be reconsidered.

A final caveat. The corporate signaling cases (Salesforce, Shopify, Microsoft, Amazon) carry narrative weight and financial incentives in addition to hiring intent. CEO public statements are written for investors first; whether they reflect operational reality is a separate question. The two recurring gaps (when not chasms) are:

  1. Deliberately misstated claims;
  2. Selective framing.

Both typically serve to ride the narrative of the moment. Selective framing, in particular, shows up in efficiency gains attributed to AI when other cost programs were the bigger driver, or workforce changes attributed to automation when the cuts were already planned regardless of AI.

Read between the lines. These companies’ cash flow is more informative than their CEOs’ statements, as Michael Burry shows in his blog Cassandra Unchained.

Synthesis

The capability cliff and the labor bifurcation point to the same thing. Where the work is verifiable end to end (final answer benchmarks, engineering tasks with well-defined scope), AI has commoditized the supply. Where verification depends on the smell test (research grade proofs, novel claims, problem framing), the human role is structural: the scarce resource no one has commoditized.

The operational rule is short: let AI verify your work, and do not let it generate work you cannot verify. The skill behind that rule is the smell test, the mathematician’s feel, trained on years of real mathematical content. That is the part the labor data and the benchmarks both keep revealing as scarce.


Further reading

Talk slides

Primary sources

Capability benchmarks (math)

Labor and economic data

Industry signal

Tools and toolkit