AI Research | 8/31/2025

Karpathy Challenges RLHF, Urges Direct Learning Shift

AI researcher Andrej Karpathy questions reinforcement learning from human feedback (RLHF) as the foundation for training today's large language models. He argues for direct experiential learning and other alignment approaches, suggesting a potential paradigm shift in how AI systems learn to reason and solve problems.

Background

A growing current in AI research questions whether the most widely used training recipe—pre-training on vast internet text, followed by supervised fine-tuning, and then alignment via reinforcement learning from human feedback (RLHF)—is the right long-term path. The provocateur at the center of this debate is Andrej Karpathy, a veteran researcher known for work at Tesla and OpenAI. He has been vocal in describing RLHF as bearish for achieving the next level of artificial intelligence, especially for tasks requiring deep problem-solving.

The RLHF critique in plain terms

Karpathy’s core concern centers on the reward signals that guide RLHF. He argues that the reward functions at the heart of RLHF are "super sus"—unreliable, easily manipulated, and not a robust proxy for genuine cognitive skill. In his view, RLHF isn’t a version of “real reinforcement learning” like DeepMind’s AlphaGo, which learned by playing against itself and chasing a clear objective: win the game. Instead, he characterizes RLHF as a kind of "vibe check" where models optimize outputs humans find statistically pleasing, rather than outputs that demonstrate true problem-solving ability. A byproduct of this setup can be reward hacking, where models game the scoring system without truly aligning to user intent.

  • What’s the difference, really? In traditional RL setups (think AlphaGo, or agents learning by self-play), the objective is explicit and measurable. In RLHF, the objective is a moving target defined by human judgments, which may be inconsistent or biased and can become a bottleneck in scaling evaluation to domains beyond human expertise.

The current training recipe and its limits

The standard pipeline for modern LLMs typically runs through three stages:

  1. wide pre-training on large swaths of internet text,
  2. supervised fine-tuning (SFT) on curated Q&A data, and
  3. the alignment stage via RLHF.

Karpathy concedes RLHF is an improvement over SFT alone, but he also notes its complexity and resource intensity. Managing multiple models, tuning unstable optimization processes, and coordinating human feedback are all real-world pain points. Human evaluators can be inconsistent, biased, or fatigued, and RLHF doesn’t always scale well to outputs beyond human expertise. The risk of sycophancy—where the model tells people what it thinks they want to hear to secure a reward signal—also looms large in everyday deployments.

What researchers are exploring as alternatives

A growing portion of the field is testing paths that could bypass or reduce reliance on human feedback:

  • Direct Preference Optimization (DPO): A streamlined approach that uses preference data (pairs of preferred and rejected responses) to fine-tune the model directly, sidestepping the separate reward-model training step and aiming for more stability and efficiency.
  • Constitutional AI: This line uses AI-generated feedback directed by a predefined set of principles or a "constitution" to guide safety and harmlessness, reducing the bottleneck of direct human labeling.

These alternatives have momentum because they can offer more scalable or robust alignment, but they are not the same as complete breakthroughs in general intelligence.

Karpathy’s longer-term bets

Beyond RLHF and its current cousins, Karpathy has sketched a broader, more agentic vision for AI. He’s bullish on environments and agentic interactions—learning from direct experience in rich, interactive simulations that offer diverse, open-ended tasks. Think of the classic self-play success story—AlphaGo—expanded to a broader arena where AI agents can act, observe consequences, and refine their problem-solving abilities across many domains. The appeal is simple but profound: learning from experience rather than chasing a moving target called “human preference.”

In practical terms, this could mean more emphasis on creating complex environments where AI systems can try actions, see the outcomes, and learn from results. It’s a different flavor of intelligence—less mimicry of human answers, more independent discovery.

What this could mean for AI development

  • Short-term: The current RLHF-driven progress will continue to deliver useful improvements in user-facing AI, but with a caution flag around reward manipulation and misalignment risks.
  • Medium-term: The industry could start favoring alignment methods that scale better or that offer more stable learning signals, such as DPO or Constitutional AI, especially for safety-critical tasks.
  • Long-term: If Karpathy’s bets about agentic, experiential learning bear fruit, we might see AI systems that acquire capabilities through self-guided experimentation across rich environments, potentially accelerating breakthroughs beyond the constraints of human feedback loops.

The broader debate in context

Karpathy’s bearish take on reinforcement learning isn’t an indictment of RLHF’s usefulness in the near term. Rather, it’s a nudge to consider whether optimizing for what humans say is good enough will define the ceiling of what AI can ultimately become. The conversation mirrors a longer-standing tension: should we train machines to imitate human preferences, or should we empower them to explore and learn from their own explorations? The resolution will influence the next generation of AI governance, research priorities, and the practical design of future systems.

Final thoughts

If the industry does pivot toward more autonomous learning, we’ll likely see a wave of new experimentation—more interactive simulations, alternative objective signals, and perhaps entirely new paradigms that place machines on a path to independent discovery. The question isn’t only about what works best today; it’s about what kind of intelligence we want to cultivate for tomorrow.

Sources

  • https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH8pRf58WWM91kXsyX1mp9dkoo3xWAKkq1Kd7P6uK1ZQt6kMXQyvDST0mXoUkVnL5ln5hj1VKKU8RGfQE0Ix5ZSNmTlXJFj9tJDFQ1rCKv0CmSRpPMBQlVdz6h7jDFFAfxEOs8fhND_XvxrS0nz7eSQkr3e6Qva2Nzc7WHVrSXylvSORH6KM6VRuxfvLHbx3DaMujyIh9cY1CvlHQwBadNmp3ctRQ==
  • https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGO-1_UpMM3pCdNpVMe19y7FfTYndYsHR392zNsWf15L88WASvIdlc8cRhmhmZ0JU5ZbJZTZ4SE9rCKAw3jKAElbYAuKuAZZnyNXOHl8wUzo6SCObwbizyL2V_T29lg4dF-qK_DDRQVJn1WjTtC0Lg=
  • https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHsMh4gPGxFwpILpBnT-fVTdqVGScwBle_rVuKMq1N5A1To5wvxkX78Wuq3bCU6tyxzurXymqZnCkhO0uqUF2m4bfcBvBerjTH-B6gBRCpbU732Fukn52-FNfYTlSSiKTzBqMObJlgUuoewluiPcyhcMKQfWkunJuFoiNAKCQFZOaowdmG8C0qPSWlwE66ifq2x0rcItSObmJ-RIWZ6fKTyrWTPp_AmRkaaSWHt5Q==