Ep 64: GPT 4.1 Lead at OpenAI Michelle Pokrass: RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today

May 8, 2025

AI Product Management User Experience Startups Business

Summary

In this episode, Michelle Pokrass, lead of post-training at OpenAI, discusses the development and launch of GPT-4.1 and the upcoming Reinforcement Fine-Tuning (RFT) offering. A central theme is OpenAI's shift from optimizing AI models primarily for benchmarks to prioritizing real-world utility and developer experience, addressing critical pain points like instruction adherence, formatting, and context length. OpenAI employs a dynamic evaluation strategy wherein their custom 'evals' remain relevant for roughly three months before needing updates, reflecting the rapid model improvement cycle. User feedback, often qualitative and ambiguous, is vital in guiding which aspects of the model to improve, underscoring the importance of a user-centric development approach. The podcast highlights the challenges in defining and measuring instruction-following due to varied user expectations, requiring more sophisticated evaluation and fine-tuning methods. OpenAI offers a range of model sizes—standard, mini, and nano—to balance cost, speed, and performance, enabling broader AI adoption especially in cost-sensitive applications. Fine-tuning, particularly RFT, is emphasized as a powerful tool for customizing models to specific use cases and enhancing instruction-following capabilities. GPT-4.1’s development involved a multi-month lifecycle with intensive alpha testing incorporating rapid iteration informed by direct user input, exemplifying an agile development process. The episode further explores how successful AI startups break down problem domains into granular evals to drive targeted improvements, reflecting a trend towards detailed and explainable model assessment. Modularity in AI systems is championed as an investment that accelerates long-term development agility despite initial complexity. Finally, the discussion touches on the evolving landscape of team composition, advocating for generalist engineers with deep product knowledge over exclusively research-focused AI expertise, highlighting how this shift aligns with current AI industry needs and startup realities.

Key Takeaways

1OpenAI's GPT-4.1 focuses intensely on practical utility for developers by addressing common real-world challenges such as instruction-following, formatting, and context window length, moving beyond traditional benchmark optimization.
2OpenAI’s evaluation metrics, or 'evals,' have a typical shelf life of about three months, reflecting the rapid pace of AI progress that quickly saturates previous benchmarks.
3User feedback drives a highly iterative and investigative post-training research process, where vague or anecdotal reports from developers are probed through prompt engineering and experiments to isolate and resolve specific model weaknesses.
4Instruction-following remains one of the most challenging aspects in large language models due to its subjective interpretation by diverse users, requiring advanced fine-tuning and multi-faceted evaluation strategies.
5OpenAI offers multiple model sizes—including standard, mini, and the newly emphasized nano model—to accommodate a broad range of cost, speed, and performance requirements, thus democratizing AI accessibility.
6Reinforcement Fine-Tuning (RFT) enables more precise customization of base models to better meet user-specific instruction-following needs, effectively expanding model applicability and satisfaction.
7The development process of GPT-4.1 featured a multi-month cycle with a focused alpha testing phase emphasizing rapid iteration and incorporation of detailed user feedback, illustrating an agile, user-driven product development model.
8Successful AI startups apply granular and modular evaluation frameworks by dissecting problems into actionable subcomponents, allowing precise tracking of model performance gains and trade-offs.
9Adopting modular AI system architectures facilitates faster iteration and experimentation by allowing easy swapping and tuning of individual components despite upfront development costs.
10The AI industry trend favors teams with strong generalist engineering skills combined with deep product and domain knowledge over solely relying on specialized AI research expertise for building practical applications.

Notable Quotes

"They know their problem really well and actually have evals for the whole problem but can break them down into specific subcomponents. So they can tell me things like, the model got better at picking the right SQL table by this percentage, but it got worse at picking the right columns by this percentage. This level of granularity really helps you tease out what actually is working and what isn't."

"I’m really long generalists. People who understand the product are really scrappy engineers who can do anything. I honestly don't think you'll need that much expertise to combine these models and these solutions in the future."

"It's a real focus right now to make sure we can run our experiments with the fewest number of GPUs and get, you know, you basically want to kick off a job and know when you wake up in the morning that you know if this thing is working or not. Is that just, like, a pure infrastructure problem or, like, you know, like the latter part? Not really. You also need to make sure that kind of the things you're training are at sufficient scale to get signal on what exactly it is you're experimenting with."

"I think synthetic data has just been, like, an incredibly powerful trend. So excited to push this more, but every more powerful model makes it easier to improve our models in the future."

"Deep Research probably most famously is a product that I use all the time, and basically, as I understand it, like, using reinforcement learning like on a tool or set of tools, right, until the model gets really good at using it. How do you imagine that type of approach scaling for agents at large?"

"I mean, the real goal of this model was something that's a joy to use for developers. Often, you know, and we're not the only ones who do this, but sometimes you optimize a model for benchmarks, and it looks really great, and then you actually try to use it, and you stumble over basic things like, oh, it's not following my instructions, or the formatting is weird, or, you know, the context is too short to be useful. And so with this model, we really focused on what have developers been telling us for a while now that they want."

"I will say it's actually more of the opposite problem. They're not coming to us with, like, oh, I have these 100 evals. Please fix all of these. It's more like they're saying, ah, it's kind of weird in this one use case, and then we have to be like, what do you mean by that? And we, like, actually, you know, get some prompts going and figure it out. So I'll say a lot of the legwork has been just, like, talking to users and really pulling out the key insights."

"Yeah, I've really loved seeing a lot of the cool UIs people have been building. So actually, this is something we snuck in near the very end of the model, is, like, much improved UI and coding capabilities. I've also loved seeing people make use of Nano. It's, you know, small and cheap and fast."

"Like, people just have demand at all points in the cost latency curve. I feel like that answer seems to have generally been yes throughout this. You know, you guys are always cutting prices, and it seems to always keep spurring more demand."

"Yeah, I think the shelf life of an eval is, like, three months, unfortunately. Like, progress is so fast. Things are getting saturated so quickly. So, we're still on the hunt, as always."

"I think where we are is that agents work remarkably well in well-scoped domains. So, you know, a case where you have all of the right tools for the model, it's fairly clear what the user is asking for. We see that all of those use cases work really well. But now it's more about bridging the gap to, like, the fuzzy and messy real world."

"We should make it easier for developers to tune, you know, if it's ambiguous, should the model ask the user for more information or should it proceed with assumptions? It's obviously super annoying if the model is always coming back to you and be like, should I do this? Are you sure? Like, can I do this? I think we need more steerability there."

"In many ways, the underlying capabilities of the models, you know, aren't being fully shown just because we haven't connected enough context or tools into the models themselves. And it seems like there's a lot of improvement on just doing that."

"I think the Ader evals are still super useful. But then there's the ones that are just, like, fully saturated and not useful. Basically, you got to, like, use the most out of an eval during its lifespan and then move on and create another one. And so I do it. The three-month shelf life definitely is tough."

"In general, my philosophy is that we should really lean into the G in AGI and try to make one model that's general. And so ideally, I think going forward, we're going to try to simplify the product offering, try to have one model for both use cases, and, you know, simplify the model picker situation in ChatGPT as well."

← All episodes Browse issues