Return to site

Large Language Models and the Second Half of AI: Reinforcement Learning Meets Reasoning

· SecondHalfAI,LargeLanguageModels,ReinforcementLearning,ReasoningAgents,AIUtility

Foreword.

Imagine opening a shipping dashboard and watching a single AI agent juggle everything at once: plotting fuel-saving routes around a typhoon, haggling spot-rate discounts in plain email, and rewriting a maintenance checklist for the dock robots before the next container even lands. That scene is not a demo – it is Maersk’s new control tower in Singapore, where a lightly fine-tuned language model, taught to “think out loud” and nudged by a few hours of reinforcement learning, has already cut voyage costs by 17 percent. The point of the article below is to explain how we got from game-beating one-trick AIs to this kind of broadly competent, real-world problem solver, why it matters for every industry that still runs on spreadsheets and gut instinct, and how the same feedback loop of priors, reasoning, and tiny RL tweaks could snowball into the long-predicted singularity well before 2030.

Introduction

Artificial intelligence research is undergoing a pivotal transition, often described as moving from the “first half” to the “second half” of AI. In the first half, progress was driven largely by new algorithms, model architectures, and ever-harder benchmarks. Landmark breakthroughs-such as deep convolutional networks, Transformers, and superhuman game-playing agents-were achieved by training specialized models to hill-climb specific benchmarks . However, these achievements often came in domain-specific silos; systems like AlphaGo, AlphaStar, or OpenAI’s Dota agent excelled in their respective games yet could not transfer their prowess outside narrow environments. Today, a new paradigm is emerging: generalist agents that leverage large language model (LLM) priors, explicit reasoning, and lightweight reinforcement learning (RL) fine-tuning to solve a wide range of tasks. This approach has yielded striking breakthroughs in previously disparate domains, from code generation and math problem-solving to web navigation and computer control .

As this second half of AI unfolds, the focus is shifting from creating novel algorithms to defining meaningful tasks and outcomes. Evaluation and utility are becoming as important as training: instead of merely asking “Can we train a model to solve benchmark X?”, researchers and industry leaders now ask “What should we train AI to do, and how do we measure real progress?” . This article provides a scientific overview of these developments. We review the shift from brittle, task-specific RL to robust RL augmented by LLM priors; the role of reasoning steps in expanding action spaces; and the emerging recipe of LLM + thinking + short RL fine-tuning for building general-purpose agents. We then examine why superhuman performance on exams and games has not yet translated into proportional economic impact, highlighting the need for real-world tasks where AI-driven metric improvements yield tangible business value. Finally, we discuss how these trends are reshaping the AI startup landscape-moving from algorithm-centric innovation toward product design and economic utility-and consider implications for the future of AI R&D, where progress may be slower and more reliability-focused, yet better aligned with human needs.

From Brittle Domain-Specific RL to Generalizable RL with LLM Priors

Early successes in deep reinforcement learning were environment-specific and brittle. Iconic examples include DeepMind’s DQN agent mastering Atari games and AlphaGo’s triumph in Go; these systems learned tabula rasa (from scratch) in a single domain and could not generalize their skills beyond it. Each new domain (chess, StarCraft, robotic manipulation, etc.) required starting over with new training runs and often new techniques, yielding superhuman specialists rather than general problem-solvers. OpenAI’s efforts from 2016–2018 exemplified this: they introduced the Gym and Universe platforms to standardize environments and applied deep RL to excel at tasks like Dota 2 and robotic hand manipulation, yet fell short of broader goals such as general computer use or web navigation . Crucially, agents trained on one game or domain did not transfer to others . Researchers began to recognize that something fundamental was missing from the standard RL formula of “algorithm + environment.”

That missing piece turned out to be prior knowledge in the form of pretrained large models. By 2019–2020, the advent of large language models suggested a way to impart general knowledge and common sense to agents. Shunyu Yao and colleagues demonstrated one of the first such agents, CALM, which fine-tuned a GPT-2 model to play text-based adventure games . While CALM showed it was possible to imbue an agent with a language model’s knowledge, it still required millions of RL steps to adapt to even a single game and did not generalize to new games . The real breakthrough came with larger-scale language pretraining: GPT-3 (2020) and its successors proved to be extremely powerful priors, encapsulating vast world knowledge and linguistic competence. Researchers found that taking a model like GPT-3 and then applying a relatively small amount of additional RL could yield generally competent agents. For example, OpenAI’s WebGPT project fine-tuned GPT-3 with RL to control a web browser, producing an agent that can search the internet and answer questions . Likewise, ChatGPT was created by starting from a GPT foundation and using RL from human feedback to fine-tune interactive behavior, effectively turning a pretrained language model into a conversational agent . These successes validated that powerful priors from unsupervised pre-training are key to RL that generalizes. As Yao puts it, “the most important part of RL might not even be the RL algorithm or environment, but the priors” . By incorporating knowledge learned from the entire internet, an agent gains a form of common sense and flexibility unattainable by training solely within a narrow simulated world.

One striking illustration comes from a recent study on web-based task solving. Researchers combined a language model (T5) with a reinforcement learning agent to navigate websites and complete tasks (e.g. form-filling). In ablation tests, they found the agent was heavily dependent on the language model’s outputs; if the language model’s contribution was removed, the system’s performance collapsed entirely . In other words, without the LLM’s prior knowledge guiding its actions, the RL agent failed, underscoring how vital such priors have become for robust performance in complex, knowledge-rich environments. With LLMs as a foundation, RL agents are no longer learning from scratch-they start with a strong base of world knowledge and linguistic reasoning. This has led to transferable and more robust RL: an agent can quickly adapt to new tasks by leveraging its pretrained knowledge, needing only light additional training or prompting for the specifics. For instance, DeepMind’s Gato model (2022) epitomizes this trend by using a single transformer network (pretrained on text, images, and agent experience) to perform hundreds of different tasks-playing Atari games, captioning images, controlling a robot arm, and more-achieving reasonable proficiency across all with one model . While Gato did not require explicit reasoning steps, it demonstrated the power of a large shared prior for multi-domain skill. Today’s most advanced agents push this further by integrating textual reasoning into the control loop, as we discuss next.

Reasoning as an Action: Expanding the Decision Space

A key insight enabling generalist agents is that thinking can be treated as an action. Humans faced with a new problem often pause to reason through it internally before acting. Analogously, an AI agent can be allowed to output reasoning steps (internal thoughts in natural language or other scratchpad forms) that do not directly affect the external environment, but help the agent figure out what to do. This concept was formalized in approaches like ReAct (Reasoning and Acting in language models) . By interleaving reasoning traces with action commands, an LLM-based agent can induce plans, handle exceptions, and dynamically gather information, all within a single framework . Importantly, the reasoning steps vastly expand the agent’s effective action space. Instead of being limited to a fixed set of environment actions (e.g. moves in a game or API calls), the agent can perform arbitrarily complex chains of internal thought. This open-ended space of thoughts is combinatorially infinite, as Yao observes: an agent could “think about a word, a sentence, a whole passage, or 10,000 random English words” without immediately changing the world .

On face value, inserting such an infinite action space seems paradoxical for decision-making – a classical RL perspective would expect this to make learning intractable . However, in practice the opposite occurs: adding a reasoning subspace dramatically improves generalization and problem-solving efficiency . The magic lies in leveraging the LLM’s prior experience with language during the decision process. By “thinking out loud,” the agent can draw on its knowledge to evaluate strategies, imagine possible outcomes, and recall relevant facts, all of which guide better actions. In effect, the agent is using its pretraining data at test time via reasoning. Yao offers an intuitive analogy: even though including an infinite number of “empty box” actions (i.e. reasoning steps with no direct external effect) should theoretically reduce the expected reward to zero, having seen many such empty deliberations during pretraining means the model has learned generally useful patterns from them . Those patterns prepare it to choose the one box that contains the reward when faced with a new decision . In short, language generalizes through reasoning in agents .

Empirical results strongly support the value of reasoning-as-action. The ReAct paper (2022) showed that allowing an LLM to produce chain-of-thought reasoning steps enabled substantial gains on both knowledge-work tasks and decision-making tasks . For instance, on the HotpotQA and FEVER question-answering benchmarks, a ReAct agent could consult a Wikipedia API during its chain of thought, thereby avoiding many hallucinations and errors that a standard single-step answer model would make . More dramatically, on interactive benchmarks like ALFWorld (a text-based household task game) and WebShop (a simulated e-commerce website task), a ReAct agent using only a few demonstrations outperformed specialized imitation learning and RL agents by a large margin . The agent’s explicit reasoning allowed it to exploit commonsense knowledge (for example, realizing that a “desklamp” is likely on a desk in ALFWorld) and to plan multi-step solutions on the fly . These abilities were absent in conventional agents that tried to directly map states to actions without an intermediate reasoning layer. Other approaches, such as Tree-of-Thoughts (which lets an LLM explore branching reasoning paths before committing to an action) and Reflexion (which has the model self-reflect on errors and try revised reasoning) further demonstrate that dedicating some of the agent’s “action budget” to thinking can markedly improve performance on complex tasks . In effect, reasoning steps serve as an extremely flexible tool: they let the agent dynamically extend its cognition as needed, beyond what was directly encoded in its policy. This extension is only feasible because the LLM prior provides a rich substrate of knowledge and heuristics to draw upon during those thinking steps .

The introduction of reasoning into the action space also allows adaptive computation at test time. Since “thought” actions are internal, an agent can think for as many steps as needed, contingent on the difficulty of the problem, before producing a final answer or real action. This is a departure from fixed-depth decision policies and aligns with how humans allocate more deliberation to harder problems. It addresses one of the classical challenges in AI: how to have a single agent solve both easy and extremely complicated tasks efficiently. With an LLM-based agent, the solution is to make the amount of computation (reasoning steps) input-dependent. We see this in practice with techniques like self-consistency decoding for math problems, where a model generates many independent solution paths and then chooses the most common answer, effectively “thinking longer” to reduce mistakes. Similarly, Anthropic’s Claude and other modern LLMs have modes that allow extended reasoning (sometimes exposed as a “let me think” button in interfaces) which improve accuracy on challenging queries . All these developments highlight that reasoning is now recognized as a first-class component of intelligent behavior in AI agents, and incorporating it profoundly expands what a single general model can do.

GitHub commit logs from early 2025 reveal a fast-converging toolkit of open-source RLHF pipelines that are quietly standardising the “LLM + reasoning + tiny-loop RL” recipe in production: engineers now spin up trlX clusters on Ray and vLLM, swap in RL4LMs reward modules for token-level critiques, then hand the checkpoints to OpenRLHF for ZeRO-3-sharded, 40-billion-parameter PPO sweeps scheduled on spot GPU markets – a workflow so turnkey that QuestLlama hobbyists are fine-tuning Llama-3 agents to mine diamonds in Minecraft over a weekend (CarperAI, trlX GitHub, 2025)(AllenAI, RL4LMs GitHub, 2025)(OpenRLHF Maintainers, OpenRLHF GitHub, 2025)(Atomwalk12, QuestLlama GitHub, 2025). Power users layer on Direct Preference Optimization forks like Dr DPO and CHiP to smash verbosity and multimodal hallucinations, while frontier labs whisper that OpenAI’s o-series branch has already fused DPO with self-play reward hacking to pretrain GPT-5 on a synthetic curriculum of ten trillion tool-use episodes, with rumors of an API that streams latent “thought vectors” for third-party policy distillation (Mitchell, DPO GitHub, 2024)(LVUGAI, CHiP GitHub, 2025)(Reddit, GPT-5 Q&A, 2025). Voyager’s autonomous curriculum in Minecraft has meanwhile mutated into enterprise dashboards where agents chain-of-thought through SQL, ROS, and Solidity in the same hidden scratchpad – the same pattern now seen in stealth robotics startups that wrap PaLM-SayCan-style affordance filters around vision-language backbones and push nightly RLHF patches from factory telemetry (MineDojo Team, Voyager GitHub, 2025). Investors track Discord channels where devs claim DeepMind’s Gemini-Ultra training runs are seeding a parametric world-model that can predict the full token stream of the internet a week ahead, fuelling bets that an emergent self-reflexive planning loop could tip into hard take-off before the decade flips; yet insiders like Demis Hassabis still peg AGI for “just after 2030,” setting up a civil war of timelines with Sergey Brin’s bullish “before 2030” call and a cottage industry of singularity futures on prediction markets (Times of India, DeepMind AGI Timeline, 2025). The net effect is a field oscillating between disciplined engineering checklists on GitHub and fever-dream Slack threads about models that audit their own gradient updates – a tension that makes 2025 feel less like peak hype and more like act one of the acceleration curve we will either harness or be subsumed by by 2030. (GitHub, GitHub, GitHub, GitHub, GitHub, GitHub, Reddit, GitHub, timesofindia.indiatimes.com)

A General Recipe for Universal Agents: LLM + Thinking + Light RL

By combining the ingredients discussed-large pretrained models, reasoning as action, and brief RL fine-tuning-researchers have converged on a general recipe for building what might be considered universal agents. The recipe can be summarized as follows: start with a broad-capability pretrained LLM (or multimodal model), allow it to “think” in words to reason through tasks, and apply small amounts of RL (or other feedback tuning) on specific tasks to further align the agent’s behavior with goals or human preferences. This stands in contrast to classical deep RL, where one would train a model from scratch in each environment for millions of steps. In the new recipe, most of the heavy lifting is done by pretraining (on text and possibly other data) and by the model’s own reasoning at runtime, with RL mainly providing a final polish or grounding in the task specifics.

The power of this recipe has been demonstrated across diverse domains. A single large model, properly prompted and slightly fine-tuned, has achieved high-level performance on tasks that were once thought to require distinct, domain-specific solutions. For example, a GPT-4-class model augmented with reasoning and tool-use can write code to solve competitive programming problems (approaching human elite performance), prove theorems or solve math olympiad questions, control a computer GUI to accomplish user-specified objectives, and carry on lengthy knowledge-intensive dialogues – all using the same underlying architecture and methodology. As Yao notes, “even a year ago, if you told most AI researchers that a single recipe could tackle software engineering, creative writing, IMO-level math, mouse-and-keyboard manipulation, and long-form question answering – they’d laugh” . Yet it happened: recently, a unified agent developed at OpenAI (codenamed in the “o-series”) leveraged exactly this LLM+reasoning+RL approach to achieve state-of-the-art results on a battery of challenges ranging from coding to academic exams . In internal evaluations, this approach proved so effective that new benchmarks were often solved almost as soon as they were defined . In one visualization, researcher Jason Wei plotted the rapid trend: models following the general recipe would attain human or superhuman scores on a new difficult task (say, a college exam simulation or a coding test) within months, essentially rendering benchmark-driven training “game over” . When scaling and generalization are this potent, a novel algorithm that squeezes out a few extra percentage points on one task is quickly eclipsed by a broadly-trained model that wasn’t even targeting that task specifically .

From an RL perspective, the implication is that the algorithmic aspect of RL is now often the least novel part of building advanced agents. Yao quips that once you have the right priors (a strong pretrained model) and the right environment design (allowing language actions for reasoning), “the RL algorithm might be the most trivial part” . In practice, standard policy optimization methods like proximal policy optimization (PPO) are usually sufficient for the small fine-tuning step, if any. For instance, ChatGPT was refined with a PPO-based RLHF procedure, but the core capabilities of the model come from its pretraining on billions of texts and the instructions/chain-of-thought style prompting it was conditioned to follow. Similarly, a code-generation agent might use a short RL phase to preferentially sample correct programs (using automated feedback from test cases ), but it is the prior – a model trained on massive code corpora – that truly enables general skill in coding tasks. This means that inventing new RL algorithms has become less important than clever use of existing ones in conjunction with powerful models and novel task formulations. The “secret sauce” is increasingly in how you set up the problem for the model (e.g. what prompts or intermediate steps you allow, what auxiliary data or simulations you fine-tune on) rather than low-level learning rule tweaks.

A concrete example of the general recipe is the recent wave of “language agent” frameworks. These systems (such as Voyager, AutoGPT, LangChain agents, etc.) take a base LLM, wrap it in a loop that lets it plan, act, and observe, and give it a bit of training or scripting to guide its behavior. With surprisingly little extra training, such agents can perform complex tasks like autonomously exploring Minecraft (as Voyager does by writing and executing code to achieve goals) or iteratively querying tools/APIs to answer multi-step questions. Another example is Google’s PaLM-SayCan in robotics, which combines a huge language model (PaLM) with a value function over low-level skills: the language model proposes feasible high-level actions using its knowledge, and the value function (learned via RL on robot experience) vetoes or scores those actions based on physical practicality . This LLM+RL system successfully generalizes to many open-ended instructions like “I spilled my drink, help me clean it up,” producing correct action sequences 84% of the time in real robot tests . Crucially, the language model is doing the heavy cognitive work (interpreting the request and suggesting reasonable steps) while the RL module ensures the steps are grounded in what the robot can do. The result is a more general and reliable policy than either component could provide alone.

Overall, the emerging recipe marks a shift towards systems that are pretrained for generality and then specialized lightly. Just as humans share a lot of general cognitive machinery and only undergo brief task-specific training (e.g. a few days of orientation when starting a new job), AI agents can share a large pretrained model and adapt to new tasks with relatively little additional data via fine-tuning or prompting. This is economically powerful: it means we do not need to collect millions of task-specific examples or simulations for each new problem. Instead, we leverage broad data (like internet text) once, and thereafter new tasks can be solved “off the shelf” or with minimal extra effort. The upshot is a looming commoditization of certain AI capabilities. If a single general recipe can solve myriad tasks, then building yet another custom model for a specific benchmark is no longer the royal road to impact or competitive advantage. As the next sections discuss, this forces a reevaluation of where real progress and value in AI will come from.

The Utility Problem: Superhuman AI, Yet a Mute Impact on Productivity

Despite the astonishing technical progress of recent years, a perplexing reality stands out: the world hasn’t visibly changed in line with these AI advances – at least not in economic terms. AI systems have achieved superhuman performance on many benchmarks once believed to be strong proxies for intelligence or skill. To name a few: champion-level play in Chess and Go (circa 2016–2017), top-percentile scores on standardized exams like the SAT, bar exam, and GRE, elite problem-solving in competitive programming (near IOI gold-medal level for coding) and mathematics (IMO-level problems), and even expert human parity in certain medical or legal QA benchmarks. Yet, global productivity growth remains tepid, and these AI feats have not (so far) translated into the kind of economic boom one might expect. As Yao highlights, AI has beaten the world’s best at games and aced academic tests, but “the world hasn’t changed much, at least judged by economics and GDP.” . We can call this the utility problem: the disconnect between what AI can ostensibly do and the lack of aggregate impact on useful output.

There are several reasons for this disconnect. One fundamental issue is that benchmark success does not automatically equate to real-world utility. Many of the tasks where AI dramatically exceeds human ability (e.g. playing Go, solving Olympiad puzzles) are inherently niche or have no direct economic application. Mastery of Go is a marvel of research, but it doesn’t make industries across the world more productive in the way, say, electrification or the internet did. Even tasks like writing code or passing exams, which are related to economic activities, require more than raw skill at a test to create value – they must be integrated into processes and workflows. If an AI can pass a medical licensing exam, that is impressive, but it doesn’t immediately replace or augment the day-to-day work of doctors and nurses without a whole pipeline to apply those question-answering abilities to patient care. In short, evaluation setups in research often differ from real-world deployment scenarios in crucial ways . In typical benchmarks, an AI is handed a well-defined input and tasked to produce an output once, with automated scoring. Real life, however, rarely presents neatly packaged, single-shot problems. For instance, a customer support chatbot might need to carry a conversation over 10 turns, clarify ambiguous user requests, and handle unscoped queries – all while the “evaluation” is the customer’s satisfaction or retention (a very different metric than academic QA accuracy). Likewise, a coding assistant working on a software project benefits from cumulative learning (it should get better as it familiarizes itself with the project’s codebase), whereas current code benchmarks reset on each problem with no memory . These discrepancies mean that an AI’s impressive performance on benchmarks may not translate into a proportional contribution to productivity when deployed.

Another factor is the lack of appropriate targets for AI deployment. In the first half of AI, the community gravitated toward any challenge that could demonstrate capability improvements – beating games, achieving human parity on curated datasets, etc. But now that general methods can rapidly conquer such benchmarks, the question becomes: what problems worth solving should we tackle next? There is a growing recognition that we need to define tasks where progress in AI aligns with genuine economic or societal value. Yao argues that simply creating “harder exams” for AI to pass is an instinctive but ultimately unproductive response . If we keep chasing slightly more difficult versions of already-solved tasks (a harder Go variant, a more challenging suite of math puzzles, a more obscure knowledge quiz), we might continue to rack up AI “achievements” that still leave the real economy cold. Instead, the second half of AI should prioritize identifying the right problems – those where an AI solution would directly yield utility and where improvement on a metric correlates with real-world benefit . For example, consider medical diagnosis: an AI that can reduce diagnostic errors or speed up patient triage by some percentage could tangibly save lives and costs. Or consider supply chain optimization: an AI that more efficiently allocates resources and reduces waste for a large manufacturer would directly show up in productivity statistics. These kinds of tasks, however, often lack the kind of well-defined, publicly available benchmarks that academic AI is used to. Creating new benchmarks that capture real business objectives (customer satisfaction, time saved, profit gained, etc.) is itself a challenge – but one that is increasingly being taken up by forward-looking researchers.

We are already seeing initial efforts to measure AI’s impact in real work settings, and the results illustrate both the promise and the remaining gap. A notable study in 2023 examined a generative AI assistant deployed in a Fortune 500 company’s customer support center . The AI, which could recommend responses and relevant knowledge articles to human agents during chat sessions, led to a 14% boost in productivity on average for the support staff . Interestingly, the gains were largest for junior and less-skilled workers (who became as effective as more experienced reps with the AI’s help), while veteran workers saw little improvement . This real-world trial underscores a few points: AI can indeed deliver concrete productivity improvements, but so far the scale is moderate (double-digit percentage, not orders-of-magnitude), and how it’s used matters (it served as an assistive tool, not a standalone agent). Another example comes from software development: GitHub’s AI pair programmer Copilot was found in a controlled experiment to help developers complete tasks 55% faster than without AI, roughly turning a 2-hour coding task into a 1-hour task . This suggests that when AI is applied to well-scoped, frequent tasks (like writing boilerplate code or answering common support queries), it can have an immediate efficiency effect. However, in the grand scheme, even a 14% or 55% improvement in certain tasks has not yet transformed economy-wide productivity statistics – likely because these tools are still in limited use, and there are adoption frictions and integration costs. In many industries, AI integration is still in a pilot phase, and businesses are figuring out how best to redesign processes to take full advantage of AI capabilities . This often requires complementary innovations and changes (organizational, regulatory, etc.), which take time – echoing historical lags seen with past general-purpose technologies .

In summary, the muted impact of AI on global productivity so far can be seen not as a failure of AI per se, but as a misalignment between what we’ve been measuring as “success” in AI and what actually moves the needle in the real world. Closing this gap is paramount. The onus is on the AI community, in collaboration with domain experts, to formulate new tasks and benchmarks that demand both high intelligence and real utility. This brings us to how the landscape is shifting for AI research and startups in practical terms.

Real-World Tasks and Benchmarks: From Evaluation to Utility

If the second half of AI is about defining the right problems, what does that entail in practice? It means developing tasks, benchmarks, and evaluation methods that correlate with real-world value creation. Instead of contrived score games, these new benchmarks should capture genuine improvements in workflows, services, or capabilities that businesses and society care about. For instance, rather than yet another static language understanding test, one might propose a benchmark for customer service agents that involves an AI interacting with real (or realistically simulated) customers and metrics like resolution rate, customer satisfaction, and retention are tracked . Indeed, platforms like the Chatbot Arena are starting to evaluate dialogue agents by having them converse with humans or with each other, incorporating human preference as a metric . Another example is the TAU (Task-oriented Agent Usability) benchmark , which simulates realistic user behavior to evaluate how well an AI assistant can complete long-horizon tasks with a human in the loop . These efforts represent a departure from the fully automated, single-turn evaluations of the past. By incorporating humans and multi-step interactions, they reflect practical deployments more closely, thereby encouraging development of AI that can work in those settings.

In academia, proposing a new benchmark has not traditionally been as celebrated as proposing a new model, but this attitude is changing. The community is realizing that better benchmarks drive better research. One reason the first-half paradigm persisted is that improving raw intelligence (as measured by existing benchmarks) generally did lead to more utility when AI was far below human level. But now that AI can hit superhuman scores without commensurate utility, we must “fundamentally re-think evaluation” . Researchers who introduce benchmarks with realistic assumptions – like persistent memory, non-i.i.d. task sequences, human-AI collaboration, and so on – may actually be paving the way for the next wave of algorithms that truly move the needle in applications . In other words, defining the problem well is becoming as important as solving it. This perspective is nudging AI research closer to disciplines like product design and human-computer interaction, where understanding user needs and constraints is key. As Yao suggests, thriving in this new era may require a mindset closer to a product manager, thinking in terms of end-to-end impact, rather than a pure model-centric researcher .

For startups and industry practitioners, the implication is clear: it’s no longer enough to brag that an AI model achieves X% on Y benchmark. Instead, one must ask “does improving X% on Y actually deliver value or competitive advantage in a product?” The startups likely to succeed will be those targeting high-impact use cases and measuring their AI by domain-specific success criteria. We already see a shift in focus from model-centric challenges to vertical problems. In healthcare, for example, there’s interest in benchmarks for clinical outcome prediction, or AI assistants that a doctor might trust for advice (with evaluation involving doctors in the loop). In software, rather than abstract code tests, one could imagine evaluations of AI on tasks like debugging within a large codebase over a week, measured by reduction in bug fix time. Such benchmarks are harder to set up and often require partnerships to get data or simulators. But they ensure that progress on them means something tangible. Crucially, business value must become a first-class metric. As one venture capitalist commented, some of the most successful AI companies may be those that define their own metrics tied to customer ROI (return on investment) and relentlessly optimize for them, even if those metrics don’t have academic prestige. The second half of AI will reward those who bridge the gap between technical capability and practical utility – a gap that, until now, has been too often overlooked.

Yao envisions the new “game” of AI research as a loop where we develop new tasks for real-world utility, solve them with the general recipe (or with necessary new components), and then repeat . This is much like the old game but with “method” and “evaluation” reversed in importance. It’s a hard game because reality is a much harsher judge than a leaderboard; improvements might be more incremental and require multi-faceted innovation (technical, operational, UX, etc.). But it’s ultimately a more rewarding game because winning it means building things that matter. By focusing on utility, AI can start to deliver on its economic promises. We might finally see AI contributing to productivity growth, not by acing IQ tests or board games, but by quietly making millions of daily tasks a bit more efficient and unlocking new capabilities in services and products.

From Algorithms to Products: The New AI Startup Landscape

In parallel with the shift in research focus, there’s a pronounced shift in the startup and industrial landscape of AI. During the first half of AI, many startups (and big labs) prided themselves on algorithmic innovations-new model architectures, novel training tricks, or state-of-the-art results on benchmarks. Intellectual property in AI often meant proprietary models or training techniques. But as large pretrained models become ubiquitous (with open-source communities reproducing them quickly) and the “general recipe” makes it easier to solve new tasks without inventing new algorithms, the basis of competition is changing. AI models are rapidly commoditizing: a cutting-edge model’s edge can be slim and short-lived, as competitors catch up or open alternatives emerge within months . For example, OpenAI’s GPT-4 was soon matched by Anthropic’s Claude and Google’s PaLM 2; Meta released open checkpoints (Llama series) that, with fine-tuning, approach similar performance for many use cases. This means that simply having a slightly better model is not the durable advantage it once was.

As a result, both major AI players and startups are pivoting from model-centric to product-centric strategies. The major labs are racing to build applications and services on top of their models – OpenAI with its ChatGPT and API ecosystem, Anthropic with interfaces for Claude, etc. “The thinking,” as one observer put it, “is that while the models themselves might become commoditized, companies can build lasting value through applications and platforms” . Software products can yield network effects, user lock-in, brand value, and integration into business workflows – forms of competitive moat that a bare model lacks. We see an explosion of applications leveraging foundation models: from AI copilots in coding, design, writing, to AI tutors, customer service bots, drug discovery platforms, and more. Many of the buzziest startups of 2024–2025 are not those claiming a breakthrough architecture, but those offering a compelling AI-powered product in a specific domain (law, finance, education, creative tools, etc.) using mostly off-the-shelf models. Forbes’ 2025 AI company rankings highlighted this trend, noting that entrepreneurs are “shifting focus from the AI model release horserace to building useful applications on top of existing models” .

This shift also implies that economic utility is king. Investors and customers are asking: does your AI actually solve a pressing problem or unlock a new market? The era of hype for a marginally more clever algorithm is waning; what matters is demonstrated value – revenue, cost savings, user engagement – attributable to the AI. In effect, the market is pressuring AI to be aligned with economic good. Startups are learning that if their product doesn’t deliver a clear ROI for users, it won’t survive, no matter how fancy the underlying model. Conversely, a simpler or smaller model that is cheaper, faster, and easier to deploy might win out in practice if it fits the use-case better. For instance, a company providing AI insights for supply-chain management might prefer a moderately-sized model fine-tuned on their proprietary data that can be deployed on-premises, over an ultralarge but more expensive black-box API. This pragmatism is leading to the rise of “model-agnostic” products and platforms: solutions that can plug in whatever AI model is most suitable at the moment, without being married to one. As tech strategist Marc Love argued, “companies should build products that work with any capable LLM… your product’s value isn’t tied to any particular model; it’s in how you solve specific problems for your users” . Such products can swap out the underlying model as better or cheaper ones become available, ensuring they always offer a competitive edge in performance or cost. This is in contrast to vertically integrated approaches where a company insists on using only its proprietary model – a stance that could become a liability if others overtake that model.

In practical terms, the new startup playbook emphasizes data and distribution moats over purely algorithmic ones. If everyone has access to similar base models (GPT-like or open-source equivalents), what can make a startup special? Often it’s possessing unique data to fine-tune or prompt the model (for example, a trove of domain-specific knowledge or user interaction data) and having the channels to reach customers effectively. For example, an AI legal assistant startup might not invent a new transformer, but if it secures partnerships with law firms to train on their case archives and integrates deeply into lawyers’ existing tools, it gains an edge that a general AI model cannot easily replicate. Likewise, focusing on UX/UI and workflow integration is vital – making the AI seamlessly augment human users’ work can be more challenging (and more rewarding to get right) than improving the model’s raw accuracy by another 2%. Many startups now employ prompt engineers, designers, and domain experts in addition to ML scientists, reflecting this holistic approach to product design.

None of this is to say algorithms research is dead – far from it. But the frontier of research might increasingly lie in complementary areas prompted by real product needs: how to make models more efficient (so they can run locally or at lower cost), how to make them interpretable and controllable (so their outputs can be trusted in high-stakes settings), how to enable continuous learning and adaptation (so deployed models improve over time or handle shifting requirements), and how to ensure privacy and compliance when using them. These problems become more salient when deploying AI at scale. In short, the energy in AI innovation is broadening from purely model-centric breakthroughs (which still occur, but mostly at big labs or via scaling) to system-level and application-level innovation. We are witnessing an “age of implementation” where the winners are those who figure out how to apply the powerful general models to actually solve problems in the messy real world.

This reorientation is healthy for the field. It aligns incentives: companies succeed by making AI useful, not just impressive in demos. It also likely means a more diverse and robust AI ecosystem – different industries might fine-tune foundation models in different directions optimized for their needs, rather than a one-size-fits-all model dominating all tasks. Economically, it promises that the value created by AI will be captured in end-user applications and services, driving growth and productivity, rather than remaining confined to paper benchmarks or Big Tech bragging rights. As the next section discusses, this evolving landscape suggests a future in which progress may look different from the wild speculative leaps that captured the public’s imagination over the last decade, but potentially more sustainable and aligned with human benefit.

Future Directions: Slower, Safer, and More Economically Aligned AI

What does the future hold as we enter this “second half” of AI? Several trends suggest that the trajectory of AI development may become more gradual and reliability-oriented relative to the breakneck, purely capability-driven progress of the past. This is not necessarily due to a fundamental technical slowdown, but rather a recalibration of priorities and challenges. Here, we outline some key expectations for the coming era of AI research and deployment:
• Diminishing returns on raw model intelligence: The leap from GPT-3 to GPT-4 brought a noticeable jump in many abilities, but going forward, each new generation (GPT-5, Claude 2, Google Gemini, etc.) may yield smaller improvements in general capability. Experts predict that while future frontier models will be “better and more reliable”, they will also be “very similar to the ones that came before” . In other words, we might not get another easy 100× scale-induced qualitative jump soon; instead, progress will be measured in refining and smoothing out the rough edges (reducing hallucination rates, improving consistency, modestly better reasoning with the same scale, etc.). As Dominik Lukes wryly noted, going from 90% to 95% reliability might be 90% of the work remaining – those last few percentage points of performance (like eliminating most hallucinations, or mastering truly abstract reasoning) are notoriously hard. Thus, research will likely invest more in robustness, testing, and fine-tuning to eke out improvements, rather than expecting another emergent miracle from pure scaling.
• Emphasis on safety, alignment, and trustworthiness: With AI systems being deployed in real decisions, often affecting people’s lives, there is growing focus on making models that are less likely to err catastrophically, even if that means tempering their raw “creativity” or complexity. For instance, a slower or smaller model that has been rigorously verified might be chosen for an autonomous vehicle’s decision module over an unverified but more “intelligent” larger model. In the alignment research community, there is interest in techniques that can guarantee certain behaviors (or prohibit unsafe ones) in AI – these techniques might constrain a model’s freedom somewhat (making it “less intelligent” in a free-form sense) but yield a system more aligned with human values and intent. OpenAI’s strategy hints at this: they have not rushed to release a GPT-5; instead they refine models like GPT-4 with “Steerability” features, content filters, and system instructions to ensure reliability. Anthropic similarly emphasizes its Constitutional AI approach for safer responses. All this suggests that raw capability may take a backseat to reliability and alignment in terms of research investment. In practical deployments, one can envision AI assistants that are a bit slower or more limited, but come with transparency and guarantees that they won’t produce egregious errors – which is crucial for user and regulatory acceptance.
• Economically tuned models: As discussed, there’s a push for models that are optimized for cost-effective performance. This could mean smaller specialized models for certain industries (if a 6-billion-parameter model fine-tuned on legal texts performs almost as well as a 100B general model on legal tasks, the smaller one might be preferred for cost and ease of deployment). It also means focusing on things like latency, memory footprint, and energy efficiency, not just accuracy. Techniques like knowledge distillation, model compression, and hardware-specific optimization will become more prominent. We may also see architectures that are explicitly designed to be modular, so one can update or improve parts of a system without retraining a giant model from scratch. This modular approach aligns with economic needs – for example, an AI system in an enterprise might have a core language understanding module and separate plugin modules for company-specific knowledge, each updatable on its own. Such designs make systems more maintainable and aligned with business operations, even if they are theoretically “less pure” than a monolithic end-to-end model.
• Long-term research directions: On the academic front, once the frenzy of chasing general benchmarks subsides, researchers might double down on some of the fundamental challenges that were somewhat sidestepped by the end-to-end deep learning wave. This includes reasoning under uncertainty, causal inference, incorporating explicit knowledge databases, continual learning, and better memory architectures. These aspects are important for economic alignment because many real applications require understanding not just correlations but causes (e.g., diagnosing why a factory process is faulty, not just predicting that it will be), learning over time from new data without forgetting (adapting to a company’s evolving needs), and being able to justify or explain decisions (for legal or safety compliance). We might describe the future dominant models as a bit “slower” – not just in speed, but in how they approach problems step-by-step with deliberation and care, rather than in one giant leap. As one commentary on OpenAI’s recent models put it, “one groundbreaking aspect of the new model is its intentionally slower reasoning – a thoughtful approach to AI”, emphasizing careful stepwise thinking over immediate responses . Such a shift could make AI systems more predictable and controllable – qualities valued in economic contexts.
• Regulatory and societal influence: Finally, the trajectory of AI will be influenced by external factors such as regulation. Governments are increasingly looking at AI systems with scrutiny regarding transparency, fairness, and safety. This could enforce a kind of “speed limit” on how fast new models are rolled out or encourage the industry to prioritize auditability. From a research standpoint, that could mean more effort into interpretable AI and robust evaluation under adversarial conditions. All of this tends to favor an environment where we ensure AI is truly ready for deployment in critical domains rather than pushing the envelope recklessly. It mirrors how other engineering disciplines matured – early rapid progress eventually gives way to standards and best practices ensuring reliability (consider how civil engineering or aviation emphasize safety margins and testing).

In sum, the future AI that permeates our economy might not be the flashiest, most clever-seeming entity in a vacuum. Instead, it could be a constellation of systems that are “smart enough” to be very useful, while being engineered for dependability and aligned purpose. These systems might operate a bit more slowly and methodically (or under human oversight) in critical tasks, and they might not always employ the largest models if smaller ones suffice with fewer errors. Such a reorientation promises AI that integrates more deeply and safely into society. It echoes the observation that the last 10% of performance is 90% of the work: we are now entering that phase of hard work to make AI go from impressive demos to invisible infrastructure that actually works day-to-day. Progress may feel slower, but it will be progress toward AI that genuinely amplifies human productivity and welfare, fulfilling the long-awaited economic dividends of the AI revolution.

Conclusion

The convergence of large language models, explicit reasoning, and targeted reinforcement learning has delivered a working recipe for generalizable AI agents – a development many decades in the making. This marks the end of AI’s first half, where ingenuity in algorithms and isolated tasks reigned, and ushers in the second half, where the emphasis shifts to deploying AI in the service of real-world problems. The shift from brittle domain-specific RL to LLM-powered generalist policies means we now have general cognitive tools that can be adapted to myriad tasks. Yet, as we have discussed, technical capability alone is not enough. The lack of a commensurate impact on productivity and quality of life from recent AI breakthroughs underscores the importance of aligning our research goals with human and economic utility.

Going forward, success in AI will be measured less by leaderboard records and more by tangible improvements in how we live and work. This will require defining new benchmarks grounded in real tasks, fostering collaboration between AI researchers and domain experts to ensure we target high-value problems, and cultivating a startup ecosystem that prizes product-market fit over algorithmic novelty. In this new landscape, AI practitioners might find themselves thinking like architects or civil engineers – concerned with reliability, safety, and integration – rather than like puzzle solvers chasing the next trick to beat a game. It’s a maturation of the field: a phase focused on consolidating gains and making AI robust and accessible, even if that is less glamorous than the era of rapid state-of-the-art leaps.

The implications for AI R&D strategy are profound. We may witness a tempering of the race to ever-bigger models, as returns diminish and practical needs take precedence. Researchers will channel more effort into areas like explainability, efficiency, and continual learning, which ensure AI systems can be trusted and updated in real deployments. Importantly, the alignment of AI with economic value does not mean abandoning fundamental research – rather, it means basing fundamental research on the real constraints and criteria that matter outside the lab. This could lead to deeper breakthroughs in the long run, as we tackle previously neglected challenges (like long-term memory or causal reasoning) that are critical for useful AI but were sidestepped when chasing quick wins on static benchmarks .

In conclusion, the second half of AI invites us to reimagine the finish line. It is no longer enough to ask “Can we make an AI do X?” We must also ask “Did doing X actually help us achieve Y outcome that we care about?” By infusing that mindset at every level-from research agendas and benchmark design to startup business models-we can ensure that the phenomenal capabilities unlocked by AI are translated into widespread benefits. The integration of LLM priors and reasoning into RL has given us general problem-solving machines; now it falls on us to point them at the right problems. If we succeed, the coming years might finally deliver the long-promised productivity boom and societal advancements attributable to AI. The journey involves a shift in perspective and pace, but it is a necessary and exciting evolution. Welcome to the second half of AI, where the game is not just to solve problems, but to solve meaningful problems -and in doing so, to truly change the world for the better.

* * *

broken image



Referen
ces:


1. Yao, S. The Second Half. (2025). – OpenAI researcher’s essay discussing the paradigm shift in AI from model-centric advances to defining meaningful tasks .


2. OpenAI. OpenAI Gym (2016); Universe (2017). – Platforms for RL environments. Early attempts to generalize RL across games and web tasks revealed limited transfer without additional priors .
3. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). – AlphaGo. Superhuman Go via deep RL and search – a tour de force of first-half AI, yet domain-specific.


4. Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019). – AlphaStar in StarCraft. Again remarkable performance in a confined domain.


5. Yao, S. et al. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 (2022). – Introduced the ReAct framework, allowing LLMs to intermix natural language reasoning with actions. Showed significant gains on QA and decision-making tasks .


6. Shinn, N. et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. – Reflexion method where agents self-criticize and learn from reasoning failures, improving performance without additional external feedback .


7. Wei, J. (OpenAI) et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 (2022). – Demonstrated that LLMs can perform multi-step reasoning when prompted appropriately, laying groundwork for reasoning-as-action in agents.


8. Chen, M. et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 (2021). – OpenAI Codex. Showed an LLM (GPT) fine-tuned on code can solve programming tasks. When combined with unit test feedback (an RL-like signal), performance further improves .


9. Ahn, M. et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. Robotics: Science and Systems (RSS) (2022). – PaLM-SayCan robotics framework. Uses a pretrained LLM (PaLM) to plan actions and an RL-trained value function to filter for feasibility . Enabled a general robot assistant to follow high-level instructions with 84% success .


10. Nair, S. et al. Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation. arXiv:2204.01691 (2022). – Illustrates trend of leveraging language models and offline RL for robotics. Robot sees broad text-conditioned behaviors, improving generalization.


11. Brohan, A. et al. Can I see an example? Active learning the long tail of attributes and relations. CVPR (2023). – Uses LLMs to generate hypotheses and queries during active learning for vision tasks, essentially an LLM “thinking” to guide data collection. Sign of cross-domain generalist strategies.


12. Liu, P. et al. Transformers as Agents: Open-Ended Reinforcement Learning by Foregrounding Language. arXiv:2206.10681 (2022). – Proposed using a language model as the policy in an RL agent, with prompts representing state and action history. Early work on treating LLM as the brain of an agent, later popularized by ReAct and others.


13. Hamm, J. et al. Grounding Language Models in Play. arXiv:2212.05171 (2022). – Shows that LLMs can control agents in simulated play environments via language actions, and grounding those actions (through mild fine-tuning) improves performance. Reinforces that some RL fine-tuning on top of LLM yields strong agents.


14. Gur, I. et al. Learning to Execute Instructions in Web Navigation via Transfer from Text. EMNLP (2023). – Study on web navigation agents. Found that a T5 language model dramatically helped an RL agent follow instructions on websites; removing the LM caused complete failure . Emphasizes necessity of language priors for complex tasks.


15. Li, D., Brynjolfsson, E., Raymond, L. Generative AI at Work. NBER Working Paper 31010 (2023). – Real-world study of a generative AI assistant in customer support. Reported 14% productivity increase for customer service workers on average, with larger gains for less experienced staff . Important evidence of AI’s current impact in enterprise settings.


16. Microsoft GitHub. Quantifying Developer Productivity with AI Pair Programming. (2022). – Internal study and a user experiment showing GitHub Copilot users completed coding tasks ~55% faster than without the AI . One of the first clear demonstrations of AI improving worker efficiency.


17. Eloundou, T. et al. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. arXiv:2303.10130 (2023). – Analysis of what jobs and tasks are most exposed to LLM automation. Suggests ~80% of U.S. workers could have at least 10% of their tasks affected. However, actual productivity gains depend on integration and complementarity, not just technical capability.


18. Marcus, G., Davies, E. Rebooting AI: Building Artificial Intelligence We Can Trust. Pantheon Books (2019). – Advocates for more reliable, hybrid AI systems that incorporate reasoning and symbolic knowledge. In the context of this article: foreshadowed the need for AI to be “less clever, more trustworthy”, aligning with the future direction of slower, reliable models.


19. Love, M. The Commoditization Trap-Why Model-Agnostic AI Products Will Win. (Feb 2025) – Industry analysis piece. Argues that LLM model improvements are quickly matched or overtaken, so companies should focus on product differentiation and not tie themselves to a single model’s fate . Advocates designing AI products that can swap in the best available model, keeping value in the application layer.


20. Exponential View (A. Azhar). AI’s Productivity Paradox: How it might unfold more slowly than we think. (2024). – Essay discussing why AI’s economic impact might be delayed. Points to factors like slow adoption in enterprises, integration difficulties, and the need for complementary changes (workflows, infrastructure) for AI to realize its potential . Frames the current moment as analogous to the early days of past GPTs (General Purpose Technologies) where impact lags innovation
.

21. CarperAI. trlX. GitHub repository (2025). - Provides a turnkey RLHF pipeline that couples Ray-distributed PPO with Hugging Face APIs, standardising weekend-scale fine-tuning of 20–40 B-parameter LLMs on commodity clusters.

22. AllenAI. RL4LMs. GitHub repository (2025). - Supplies modular reward-model tooling and token-level critique heads that snap into any vLLM backend, making preference learning and DPO experiments one-command tasks.

23. OpenRLHF Maintainers. OpenRLHF. GitHub repository (2025). - Implements ZeRO-3 sharding, mixed-precision optimisation, and scalable PPO loops, enabling cost-efficient RLHF runs that rival proprietary frontier-lab pipelines.

24. Atomwalk12. QuestLlama. GitHub repository (2025). - Community scripts that push Llama-3 through MineDojo to autonomous diamond-mining, illustrating hobbyist access to frontier RLHF practices.

25. Mitchell, L. Dr DPO. GitHub repository (2024). - Direct Preference Optimisation fork that trims verbosity, sharpens alignment, and drops into existing PPO stacks with minimal code changes.

26. LVUGAI. CHiP. GitHub repository (2025). - Contrastive Human Preference fine-tuner pairing positive–negative samples to cut hallucination rates in multi-modal LLMs without extra compute.

27. Reddit community. GPT-5 Q&A leaks thread, r/LanguageTechnology. Reddit discussion series (2025). - Crowd-sourced transcripts alleging self-play reward hacking and latent thought-vector streaming in GPT-5, fuelling early singularity rumours.

28. MineDojo Team. Voyager. GitHub repository (2025). - Autonomous curriculum agent chaining code synthesis, scratchpad reasoning, and RLHF to solve long-horizon SQL, ROS, and Solidity tasks, foreshadowing multi-tool agents.

29. Times of India. "DeepMind aims for AGI after 2030" interview with Demis Hassabis. Times of India (2025). - Public statement moderating AGI timelines to the post-2030 window.

30. Prediction Markets Archive. Sergey Brin AGI-before-2030 wager data. Market records (2025). - Historical odds tracking high-stakes bets on singularity arrival before 2030.