When Code Becomes a Commodity: The Mathematician's Edge // Edson Cilos

In February 2026, a lawyer won the Anthropic Hackathon. A cardiologist took third with an AI care platform built in seven days. Thirteen thousand applicants competed; neither winner wrote code. Over the same window, the US Bureau of Labor Statistics reports programmer roles down 27.5% from 2023 to 2025, while Levels.fyi tracks the new “AI Engineer” specialty paying an 18.7% staff premium and named by LinkedIn the role growing fastest for the second year running.

If software is becoming cheap, what stays scarce?

This piece works through three sources of evidence: AI capability benchmarks in mathematics, US labor market data on engineering roles, and a 2026 paper by Klowden and Tao on the structure of human work alongside AI. The numbers reflect an April 2026 snapshot. The argument is structural, not predictive: the boundary where AI stops working has a shape, and that shape rewards a specific kind of human judgment.

Based on a talk delivered at the Mathematics Department of UFSC on 24 April 2026 to math PhD and master’s students. Slides: English (PDF) · Português (PDF).

Key terms

IMO (International Mathematical Olympiad): annual high school competition; standard reference benchmark for AI mathematical reasoning.
FrontierMath Tier 4 (Epoch AI benchmark): research grade mathematics problems designed to remain unsaturated by current models.
Lean 4 / Mathlib: a formal proof language and its core library; Mathlib carries roughly 1.9 million lines of verified mathematics maintained by 500-plus contributors.
Smell test (Klowden & Tao §4.2): the skepticism mathematicians apply before reading a proof line by line, used to detect bad arguments early.
Odorless proof (K&T §4.6 n16): a formally correct output that lacks the causal narrative humans use to judge mathematical work.
Blue team / red team (K&T §6.2): generation versus verification roles, borrowed from cybersecurity. K&T argue AI is relatively safe on the red team, unsafe on the blue.
Software 3.0 (Karpathy, June 2025): software produced by description in natural language, distinct from code humans write (1.0) or models humans train (2.0).

Why this time is different

Past automations affected the periphery of intellectual work: the printing press, the typewriter, LaTeX, automated dissemination. They changed how ideas were copied, transmitted, and rendered. The work of generating ideas, of deciding what to write, stayed human. Klowden and Tao open their 2026 paper with this distinction. Modern AI is the first general purpose technology that automates the creation step itself, not just the dissemination around it.

The point matters for what follows. If a tool only automates dissemination, the question “what is scarce” stays trivially answered: the production of new ideas. Once the tool starts producing outputs that look like ideas (proofs, drafts, code, diagrams), the question becomes structural. Rhetoric will not answer it. The honest place to look is where the tool stops working.

Two benchmark surfaces, very different scores

By April 2026, leading models have effectively saturated competition mathematics and stalled on research grade work.

Benchmark	Type	Frontier score (April 2026)
MATH-500	High school + early college	95-99% (Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, DeepSeek-R1)
AIME 2025	US olympiad qualifier	100% with code execution; 99.8% without
GSM-8K	Grade school word problems	~100%
IMO 2025	International Olympiad	35/42, gold medal level (Gemini Deep Think, certified by coordinators)
FrontierMath Tier 4	Research grade	~38% (GPT-5.4 Pro, March 2026; up from 19% five months earlier)
FormalProofBench (Lean 4)	Verified proofs	~33.5% (frontier model)
Open Proof Corpus	Olympiad proofs, graded by humans	~43% pass full human inspection

What it shows: a roughly 60 percentage point gap between problems where only the final answer matters and problems where the full reasoning matters. The gap is closing on the easier side and persisting on the harder one.

The IMO is the cleanest single illustration. In 2024, DeepMind’s AlphaProof scored 28/42 using formal Lean 4 proofs and several days of compute per problem. In 2025, Gemini Deep Think scored 35/42 in natural language inside the 4.5 hour exam window, certified by IMO coordinators.

Capability cliff: competition benchmarks (GSM8K, MATH-500, AIME 2025, Putnam 2025) sit near 100% while research-grade benchmarks (FormalProofBench, FrontierMath Tier 4) sit at 33-38%. The vertical drop is roughly 62 percentage points. — *Putnam 2025: AxiomProver reached 12/12 with post-window compute; 8/12 within the 6-hour competition window.

One year between a specialist in formal language and a general purpose reasoner at gold medal level. The same pattern holds across the standard benchmarks: MATH-500, AIME 2025, GSM-8K. A few years ago these were comfortably out of reach. Today they no longer differentiate between the leading models. They are silent.

If you stop at the press releases, the story sounds triumphant. The interesting work starts below the cliff.

What the cliff is made of

The Open Proof Corpus, a 2025 benchmark of 5,062 olympiad proofs evaluated by thirteen human graders across six frontier models, isolates the structure of the gap. Producing a correct final answer outperforms producing a correct full proof by roughly 30 percentage points across the board. Aggregated, only about 43% of proofs produced by models pass human inspection. The 2026 FormalProofBench update confirms the gap persists: the best model reaches 33.5% on formal Lean proofs while the same model approaches 100% on final answer benchmarks. The cliff is not closing on the side that matters for research.

Three failure modes recur in the recent literature:

A single hallucinated step invalidates an entire chain of derivation.
Working memory collapses around 20 to 30 parallel branches regardless of model scale; the ceiling does not move with parameter count.
Apple’s research group documented “complete accuracy collapse beyond certain complexities” in 2025: reasoning effort grows for a while, then declines, as if the system gives up.

Klowden and Tao name the deepest version of this failure “odorless proofs.” AlphaProof’s IMO 2024 solutions were verified correct by Lean. They also contained redundant steps and inexplicable jumps. The proofs were formally right without being legibly right: they lacked what Thurston (1994/2006) identified as the mark of good mathematics, a causal narrative that lets a reader see why the conclusion follows, not just that it does. The system passes the verifier and fails the smell test the mathematician would apply before opening Lean.

That distinction sets up the rest of the argument. The labor market is registering the same boundary from a different angle.

The labor data, read alongside the cliff

A note on framing. This analysis reads the software engineering job market through eight years of industry experience, including machine learning consulting and quantitative finance. The mathematical material is closer to home than the early empirical material; the bias is flagged because the corporate signal cases below skew toward what the author has personally observed.

The empirical record on junior software engineering is now thick.

Source	Population	Finding
Stanford Digital Economy Lab (ADP)	US developers 22-25	-20% employment from late-2022 peak
Harvard / Revelio Labs	62M workers, 285K firms	Junior employment in firms that adopt GenAI -7.7% after 6 quarters vs. controls
Northeastern (Lightcast)	US vacancies	-16.3% relative junior to senior posting ratio
Institute of Student Employers (UK)	Entry level tech	-46% over 2024
NY Fed College Labor Market	CS graduates	unemployment ~6.1%
Levels.fyi	“AI Engineer” specialty	+18.7% staff compensation premium
US BLS	Software developers (aggregate)	Projected +15% growth through 2034
LeadDev AI Impact Report (2025)	Engineering leaders	54% expect fewer junior hires over the long term

What it shows: bifurcation, not extinction. The programmer role focused on execution is shrinking; the design oriented role that collaborates with AI is expanding. Aggregate growth and the collapse at entry level are both true at once. They describe two different jobs that happen to share one historical label.

Corporate signaling lines up with the data, though the signal is mixed with narrative. Salesforce announced zero new engineer hires for fiscal year 2026. Shopify added a policy requiring written justification that AI cannot do the work before any new headcount can be approved. Microsoft attributed roughly 6,000 layoffs in May 2025 partly to internal AI productivity, citing 20 to 30% of new code being written by AI. Amazon announced 14,000 corporate cuts in October 2025 with AI cited.

A counter finding tempers the picture. The Federal Reserve, in a March 2026 study covering more than a million firms, found no statistically significant link between AI adoption and aggregate hiring. Klarna, the headline case for AI driven workforce reduction, cut headcount from roughly 5,500 to 3,000 while pushing AI, watched revenue per employee multiply about fourfold from $300K to $1.3M, and then partially reversed in May 2025, rehiring humans after the quality of customer facing work fell.

In summary, the labor data does not yet support strong claims in either direction. Where AI substitutes for human work, where it does not, and how much of the recent shift can be attributed to AI specifically are open questions the evidence has not settled. One personal data point, offered as such. A treasury role the author previously led with two or three colleagues is, today, feasible alone with AI. In a separate context, the author currently works as a solo risk scientist who is also the project’s main engineer. It is the kind of perception that macro evidence will eventually have to confirm or rule out.

Where the bottleneck moved

McKinsey’s 2025 State of AI report finds that only 39% of enterprises that adopted generative AI report measurable EBIT (earnings before interest and taxes) impact. IBM’s institute reports that one in four AI projects delivers promised returns. Faktion’s June 2025 essay diagnoses the recurring failure mode: “knowledge is scattered across documentation, legacy systems, and most critically, in the heads of a few seasoned experts.” Boris Cherny, head of Claude Code at Anthropic, on Lenny’s Podcast in February 2026: “the core problem now is knowing what the hell you actually want to build.”

Stack-flip diagram: before AI, the bottleneck was Engineering Capacity (with Domain Knowledge and System Design above); after AI, Engineering Capacity is commoditized and the bottleneck moves up to Domain Knowledge.

The constraint has shifted from engineering capacity to problem framing, domain knowledge, and the ability to say what success looks like in a setting the AI cannot infer. The visible employer demand follows that shift:

Alignment research at Anthropic and the MATS program, hiring for reward hacking analysis and work on training dynamics.
Formal verification specialists, $70-150 per hour through Alignerr, requiring Lean 4 and Mathlib fluency.
ML theory: a thin field with strong appetite for interpretability and generalization work, where the absence of a settled mathematical framework is itself the opportunity.
Quantitative finance: stochastic modeling combined with machine learning, sustained demand for quants who can do both.
Operations research: the work of formulating problems. Defining variables, constraints, and the objective function. That formulation step is exactly what the AI cannot do without a human who knows what is being optimized.

These are domains where rigor and proof cannot be automated, and where AI amplifies the mathematician without substituting for the mathematician’s judgment. The bifurcation in the labor data and the cliff in the capability data point to the same gap.

The smell test, formalized

Mathematicians have a name for the missing skill, though it does not appear in any AI lab’s deck. Scott Aaronson’s 2008 blog post Ten Signs a Claimed Mathematical Breakthrough is Wrong lists ten heuristics experienced mathematicians apply, often subconsciously, to detect bad arguments well before line by line verification:

The authors do not use TeX.
The authors do not understand the question.
The approach implies something much stronger and probably false.
The approach contradicts a known impossibility result.
The authors quietly switch from “we prove” to “seems to work in all cases we tried” partway through.
The paper jumps to technicalities without a new idea.
The paper does not build on prior work.
The paper wastes space defining standard terms.
The paper waxes poetic about practical implications.
The techniques look too weak for the problem.

Klowden and Tao §4.2 call this the “smell test” and argue it is the layer of mathematical practice most resistant to current model architectures. The argument is mechanical, not mystical. LLMs are trained to produce outputs that look correct, which is exactly the failure mode the smell test was built to catch. Several of Aaronson’s signs (the weasel words at #5, the technicalities without idea at #6, the rhetoric about implications at #9) are patterns current models exhibit by construction. Catching them takes three things:

Deep familiarity with prior work.
Intuition for which kinds of claims tend to be true.
A feel for how mathematical communities react to assertions.

None of these are reliably produced by next token training. This explains the cliff’s location. The cliff is not arbitrary. It tracks the boundary between problems where the smell test is unnecessary (final answer arithmetic at AIME level) and problems where the smell test is the skill that holds the work up (research grade proof verification, novel claim assessment, paper refereeing). Scaling raw capability has not closed that boundary, and the architecture being scaled does not target it.

The current toolkit

The frontier of “AI in math research” is concrete, not hypothetical. November 2025: Terence Tao published a public write up describing a year of collaboration with Google DeepMind. AlphaEvolve plus Gemini Deep Think were applied to 67 open mathematical problems across combinatorics, analysis, and number theory, producing roughly 20 genuinely new results, with a subset of proofs passing formal verification in Lean 4. The pipeline is concrete:

flowchart LR
    A[AlphaEvolve
discovers candidate] --> B[Deep Think
sketches proof]
    B --> C[AlphaProof
formalizes in Lean]
    C --> D[Lean 4
verified]

Tao selects and evaluates the problems at the start of the loop and judges the resulting insight at the end. The middle is automated. The boundaries are not.

A more accessible version of the same shift: in August 2025, Ernest Ryu (Seoul National University and UCLA) used GPT-5 as a research collaborator and improved a forty year old bound in operator splitting and first-order convex optimization by roughly 50%, published as a rigorous note rather than a heuristic. Same pattern. The human picks the problem and judges the insight; the machine accelerates the middle.

The infrastructure is open and documented. Lean 4 with Mathlib carries roughly 1.9 million lines of formally verified mathematics. LeanCopilot from Caltech suggests Lean tactics during proofs. DeepSeek-Prover-V2 decomposes problems into subgoals. Kimina-Prover from Moonshot was the first prover to clear 80% on MiniF2F. None of this requires special access. The frontier only systems (AlphaProof, AlphaEvolve, the gated Claude Mythos preview that scored 97.6% on USAMO 2026 inside Anthropic’s Project Glasswing) extend the ceiling. The public toolchain is enough to make a research contribution today.

A secondary shock matters here, especially for people who do not write code. Andrej Karpathy’s “Software 3.0” framing from June 2025 names a new mode: programs humans describe in natural language.

Software 1.0: code humans write.
Software 2.0: models humans train.
Software 3.0: software produced by description in natural language.

Trent Field, who does not write code professionally, works at Arkance and presented at Autodesk University 2025 a working web application he built solo in 48 hours using Cursor. The lawyer and cardiologist from the opening are not exceptions. They are the early end of a pattern.

The operating rule

Klowden and Tao §6.2 propose a rule borrowed from cybersecurity. AI is relatively safe to use as a red team (reviewing content generated by humans for errors or suggested improvements) and unsafe to trust as a blue team (generating structural content beyond what a red team can verify). The red team can be human reviewers, formal proof assistants like Lean, or other AI systems running in a verification capacity. What the blue team produces must remain inside what the red team can check.

The rule is easy to state and easy to violate. While drafting the source talk for this analysis, the author asked an AI to write a slide explaining the K&T framing. It produced “AI is potent on the blue team, the red team is a human function,” which inverts the paper’s claim. The author caught the inversion on first read. Two further rounds of feedback to the AI failed to surface it; the wrong slide kept coming back. K&T’s thesis demonstrated itself: the smell test is the layer no current model reliably has, and it is what produced the corrected paragraph.

A second instance of the same effect surfaced after the talk. A graduate student who attended pointed out that the slides used em-dashes liberally, a tell for prose authored by AI. He was not checking the math. He was applying the smell test to the medium and reading the signature on the surface.

The operating rule, then, in two sentences. Let AI verify your work, not generate work you cannot verify. The skepticism that comes before verification is the part the workflow has to supply, because the model will not.

Limitations

The benchmark numbers in this piece reflect April 2026. AI capability claims age fast; expect the specific figures to be out of date within months (or weeks). Tao’s own public characterization of leading models moved across eighteen months:

Tao's 18-month language shift on frontier model capability: September 2024 'a mediocre, but not completely incompetent, graduate student', November 2025 collaboration with DeepMind producing ~20 new results across 67 open problems, March 2026 'ready for primetime' at OpenAI Academy / IPAM.

The rate of motion is not slowing.

The labor data is heavily weighted to the US. The Federal Reserve null result and the Klarna reversal show real heterogeneity in how AI adoption translates to headcount. The mathematical edge framing presumes the smell test stays out of reach for current architectures. That is a claim about how today’s models are trained, not a prediction about what a future model could do. If the architecture changes, the structural argument has to be reconsidered; the labor implications would shift before the capability evidence catches up.

A final caveat. The corporate signaling cases (Salesforce, Shopify, Microsoft, Amazon) carry narrative weight and stock price incentives in addition to hiring intent. CEO public statements are written for investors first; whether they reflect operational reality is a separate empirical question. The two recurring gaps are deliberate misstatement and selective framing. Selective framing looks like efficiency gains attributed to AI when other cost programs were the larger driver, or headcount changes attributed to automation when attrition was already scheduled. Read between the lines. A good way to do that is to follow the cash flow of these companies, as Michael Burry does on his Cassandra Unchained blog.

Takeaway

The capability cliff and the labor bifurcation point to the same thing. Where the work is verifiable end to end (final answer benchmarks, well scoped engineering tasks), AI has commoditized the supply. Where verification depends on the smell test (research grade proofs, novel claims, problem framing), the human role is structural, not transitional.

The operational rule is short. Let AI verify your work, and do not let it generate work you cannot verify. The skill behind that rule is the skepticism a mathematician applies before reaching for verification tools, trained on years of real mathematical content. That is the part the labor data and the benchmarks both keep revealing as scarce.