AI agents are already delivering massive productivity gains

The fundamental technology of an LLM can be accurately described as “autocomplete”: it predicts the next word in a sequence based on what the next word would be in its training materials. This turns out to be a highly efficient way to transmit information for many contexts, corresponding well to verbal stream of consciousness, but it essentially answers every question on a first-impression basis. Thus, the logic of early LLMs could not be explained, resulting in hallucinations, which posed severe challenges to widespread business adoption outside of novelty applications.

From their elementary years on up, human students are taught various study techniques in order not to rely solely on their first impressions of things. They may need to think through their thought process before writing out their answer, and then make proper use of tools like calculators to eliminate ‘busy work’ while ensuring accuracy. This is how a new generation of “reasoning models” have solved the “out of sample problem” in statistics, of simply memorizing training data while being unable to generalize knowledge into new contexts.

Mis-optimization

For some time, it was assumed that essentially all problems could be resolved with sufficient scaling of pre-training (essentially absorbing all of the data on the internet, prior to the more refined fine-tuning process). A 2020 paper by several OpenAI authors titled “scaling laws for neural language models” demonstrated performance improvements across several orders of magnitude of compute, which referred solely to training compute, the focus of optimization at the time. By the end of 2024, however, Silicon Valley felt a growing sense of despair that pre-training scaling was finished – in part because the entire internet had already been absorbed. A GPT-5 model still has yet to be released.

Two major developments have occurred since then. The first was OpenAI’s development of its o1 and o3 reasoning models, with appear to have substantially solved the logic problem. o3 can solve math problems (at a 25% success rate) which as late as November 2024, the Fields-winning mathematician Terrence Tao predicted would take years to automate. It achieves this result through “chain of thought” reasoning, thinking sometimes for several minutes before providing an answer, just like the student who shows his work before providing the full answer. During prompt execution, the model calls different LLMs to execute various parts of the task, which could be called agents.

This shift to inference-time (also called test-time) compute scaling caught the entire industry off guard, including regulators. Definitions of “foundation models” across jurisdictions make reference to the number of parameters in the model; American regulators in particular have attempted to specify specific scales which constitute potential safety risks, such as 1 billion parameters in the proposed 2023 AI Foundational Model Transparency Act. EU regulations such as the AI Act, in contrast, have put more emphasis on the functions of a model, while still mentioning size as one factor.

Agents are what you make of them

The second surprise occurred with the January release of r1, another reasoning model with significantly lighter computational requirements, by the Chinese AI competitor DeepSeek. These algorithmic improvements call into question the relevance of scaling laws in the first place, but from the user perspective, they means that performance will only be better than predicted by the scaling laws, whether for training or inference. Like o1 and o3, r1 also makes use of chain-of-thought reasoning, but its main innovation there is simply in making its reasoning process more transparent. In other words, again from the user perspective, it is “agentic” in a way which other models already were, but attempted to conceal.

In a December blog post, OpenAI CEO Sam Altman predicted that 2025 would be the year of real-world agentic AI utilization. In the end, however, it may be the user experience which defines what this means, more than any specific architecture. It is worth noting that although practical applications of computing innovations often take longer than first anticipated, reasoning models are not only good at math.

Specifically regarding the financial sector, we could imagine that with a moderate inference budget, it will become possible to design statistically rigorous trading bots at the push of a button, which will also have access to online news. Financial institutions will take advantage of this capability, but in the longer term, the main beneficiaries may be individual traders, who never previously had this capability, opening up possibilities for regulatory arbitrage.

Formal institutions will still be regulated largely as before, especially if actual decisions are being made by replicable code output by models – but individuals must rely on AI safety laws to ensure that their models are acting in line with their intentions. In that regard, the EU regulatory approach appears more robust than the American one, despite potentially also covering some conventional software.