Unsupervised Learning

Ep 66: Member of Technical Staff at Anthropic Sholto Douglas on Claude 4, Next Phase for AI Coding, and the Path to AI Coworker

May 22, 2025
Listen Now

Summary

In this episode of Unsupervised Learning, Sholto Douglas, Member of Technical Staff at Anthropic, offers an in-depth discussion around the latest AI model Claude 4 and its implications for AI coding, research acceleration, and the evolving role of AI as a coworker. Claude 4 demonstrates enhanced capabilities in autonomous coding within large, complex codebases, showing meaningful progress toward agentic AI that can take multi-step actions with limited human input. Coding is emphasized as the leading indicator of AI progress due to its structured nature, abundant data, and clear evaluation metrics, enabling faster mastery and serving as a bellwether for general AI capabilities. Despite gains, AI agent reliability remains an ongoing challenge, with success rates improving incrementally over multiple time horizons rather than guaranteeing first-try correctness, reflecting a key metric for model maturity. AI agents currently accelerate research mainly through automating engineering work, with scientific idea generation seen as an emerging capability that will grow with improved domain-specific feedback loops. The episode discusses verifiability challenges in domains like medicine and law but notes promising advances in domain-specific benchmarks that make AI contributions more reliable beyond coding and ML research. It highlights the necessity of personalization and fine-tuning of models at the company and individual levels to unlock true value beyond generic capabilities, exemplified through partnerships such as Anthropic’s collaboration with Databricks. A forecast situates near-superhuman coding reliability within 1-2 years, with broader white-collar job automation expected by the late 2020s, marking significant shifts in workforce and enterprise operation. The discussion also covers the dominant trend of scaling large foundation models with adaptive compute strategies to optimize efficiency and capability. Energy consumption and compute capacity emerge as critical long-term constraints, urging government investment and policy considerations. Finally, the episode addresses AI evaluation strategies, the interplay between labs and app developers in the AI ecosystem, and the importance of interpretability research for model safety and alignment. Throughout, Douglas underscores AI’s accelerating progress and the transformative potential as AI transitions from tools to intelligent, personable collaborators and autonomous researchers.

Key Takeaways

  • 1Agent reliability is a central challenge in AI deployment, with current models showing substantial progress but not yet guaranteeing success in one-shot scenarios. Instead, success rates improve incrementally when evaluated over multiple attempts or extended time horizons, reflecting the complexity of real-world tasks and stochastic nature of AI. This nuanced reliability metric moves away from simplistic notions of correctness, informing more realistic expectations and guiding ongoing engineering efforts.
  • 2Coding is the clearest early indicator of AI model progress due to its structured data, well-understood success criteria, and direct correlation with accelerating AI research. Anthropic prioritizes advances in coding capability as a gateway to broader AI competencies and practical productivity enhancements in software development and scientific endeavors.
  • 3AI agents are already transforming research workflows by autonomously performing engineering tasks and iterating efficiently on complex problems, although fully autonomous scientific breakthroughs remain nascent. Progress is expected to expand as feedback loops refine domain-specific expertise, eventually enabling AI to propose novel scientific ideas.
  • 4Verifiability—how clearly an AI task’s outcome can be assessed—is a major factor influencing AI progress, with coding and ML research benefiting from high verifiability, whereas fields like medicine and law face challenges due to inherently fuzzy evaluations. Emerging benchmarks that simulate human expert evaluation are making these domains more accessible to AI improvements by translating complex tasks into verifiable problems.
  • 5Personalization and fine-tuning of AI models at the individual company or user level are crucial for realizing full productivity and safety benefits beyond broad industry-specific solutions. Deep integration of proprietary data and organizational context tailors AI behavior to unique needs, driving adoption and user satisfaction.
  • 6Scaling large foundation models with adaptive compute allocation is the dominant strategy for AI advancement, blurring distinctions between small and large models. This approach dynamically assigns computational resources based on task difficulty, yielding more efficient and capable AI systems.
  • 7Near-term projections envisage AI models achieving superhuman reliability in coding within one to two years, enabling robust expert-level programming and the automation or dramatic augmentation of many white-collar jobs by the late 2020s. This timeline reflects confidence in current algorithms and the maturation of evaluation feedback loops.
  • 8Rapid AI progress is empowering a new scale of AI researchers capable of proposing experiments without requiring massive robotics or biological datasets, emphasizing the importance of integrating real-world feedback loops to convert AI innovations into substantial economic impact.
  • 9Developing specialized reward models tailored to individual white-collar professions is feasible with limited data, mirroring human learning processes and enabling near-expert performance despite AI's sample inefficiencies. This approach supports scalable, domain-specific AI adaptation beyond generic training.

Notable Quotes

""I really do think that measuring success rate over time horizon is the right way to think about this, you know, extension of, like, agent capabilities. And I think we're making a hell of a lot of progress. We're not 100% there on reliability. You know, these models don't succeed all the time.""

""I think if we were to basically, like, fall off trendline. So if, let's say, by the middle of next year, you started to see some kind of block on the time horizon with which these models are capable of acting. And I think you should look at that, like, coding is always the leading indicator in AI. So I think, like, you would see that drop off in coding first.""

""Because coding is the – that first step in which you will see AI research itself being accelerated. And so, we care a lot about coding. We care a lot about measuring progress on coding. We think it's the most sort of important leading indicator of model capabilities.""

""By the end of next year, I think we should see, like – it should be very obvious that this is near guaranteed. Even by the end of this year, that's, like – that should be, like, pretty clear. But by the end of next year, you'll have these things going around, doing a lot of things for you in your browsers.""

""One mismatch that I think we might see—and I'm actually also worried about us seeing—is you'll see a huge impact on white-collar work. And whether that looks like just dramatic augmentation, you know, like TBD, but you will see that world change a lot. And we'll need to pull forward the dramatic transformation of things that make our lives a hell of a lot better.""

""But by that time, we'll have, like, you know, millions of AI researchers, like, propose experiments. They don't need such a large scale of robotics or biological data. So AI progress goes really fast. But we need to make sure that we, like, pull in the feedback loop to the real world to actually deliver on, like, meaningfully changing world GDP and this kind of stuff.""

""I think most people in the field currently believe that, like, the pre-training plus RL paradigms, which we've explored so far, are themselves sufficient to reach AGI. This, like, we haven't seen the trend lines, like, bending yet. Like, it works. Like, this sort of combination of things. Whether there are other mountains to climb that could get us there faster, it's entirely possible.""

""The limiting factor in this will be energy compute. Like, when do you think we start to bump up against that? I think there's a great table at the end of situational awareness, which details this, where, like, by the end of the decade, we start to report, like, really, like, dramatic percentages of U.S. energy production. Like, you're over 20, I think maybe 2028, like, 20% of U.S. energy. And so you can't go orders of magnitude more than that without, like, dramatic changes. This is somewhere I think we need to invest more. I think this is one of, like, the important vectors along which governments should act.""

""Because, obviously, yeah, especially as you get into, like, a lot of these different verticals that you might want to improve on, it's, like, it's hard for you guys to figure out, like, what is, like, the specific thing in logistics or, you know, or legal or accounting or whatever it is. And it requires such expertise and taste.""

""So expect to see continual rises in model capability. Like, expect, basically, by the end of this year, the coding agents, one good metric will be the coding agents that are taking their first, like, halting steps today should be very competent. You will probably feel very confident in delegating substantial amounts of work for hours on, like, hours a few minutes. What's going to be your check-in time?""

"And I expect 2025 to feel meaningfully faster. Particularly also because as models get more capable, the set of, like, reward available to them expands in important ways. You know, if you have to give feedback on every single, like, sentence that it outputs, right, this is not very scalable. But if you're able to allow it to do hours of work in such a way that you can just judge, did it complete the thing I wanted? Did it do the right piece of analysis? Did the website work? And were people able to, you know, message on it and this kind of stuff? It means that basically it should be able to climb these, like, rungs of the ladder ever faster, even though the complexity of the tasks is increasing."

"You mentioned earlier there's, like, OpenAdCodex. There's, you know, GoogleJuels. There's, like, all this different stuff. There's all these startups building out. And we're actually launching a GitHub agency. So you'll be able to, like, anywhere on GitHub, you'll be able to say, hey, at Claude, and we'll, you know, spin off and do some work for you. Yeah, so everyone is, like, competing for the hearts and minds of developers."

"One is the, like, main metric that the labs will be judged on is how effectively they are able to convert accelerators and, like, flops and dollars, like, capital into intelligence. Like, that is the most important by far metric. And this is the metric that has sort of distinguished companies like Anthropic, companies like OpenAI and DeepMind from really, like, the rest of the pack, right? It's, like, the models that are trained by these companies are better."

"If you take, like, a single file transformer implementation, like, you know, Carpathies MinGPD or something like this. And ask the model to implement ideas that you see in papers. You will be stunned by how good it is. It is just, like, kind of wild. And then, if you go into, like, some huge, like, transformer code base and ask it, you'll notice that it's actually, it's a little bit harder. Like, the models struggle a bit more there. But they struggle less and less every month. So, that's, like, a good way to presage the future."

"And I think robotics is, like, quite potentially one of the areas where this is true. And I think this is, like, also true of, you know, of many domains, but robotics is, like, this is starkly true because our sort of progress in understanding the world has gone so far ahead of our ability to manipulate it physically."