No Priors: Artificial Intelligence | Technology | Startups

Inside Deep Research with Isa Fulford: Building the Future of AI Agents

Apr 24, 2025
Open in new tab →

Summary

In this episode of No Priors, Sarah interviews Isa Fulford, a key figure behind OpenAI’s Deep Research initiative, focusing on the development and future trajectory of AI agents. The discussion opens by detailing the inception of Deep Research, emphasizing its unique capability to perform multi-step research tasks by integrating reasoning with tools such as web browsing and Python execution. Isa explains that the product targets well-defined, read-only information synthesis tasks rather than typical transactional uses, aligning with OpenAI’s broader ambition of building an Artificial General Intelligence (AGI) capable of scientific discovery. A notable segment covers the importance of human expert data in training, which supplements synthetic datasets to help agents evaluate information quality and relevance with nuanced reasoning. The episode also highlights the significant challenges around privacy and security, especially as agents gain access to sensitive personal or corporate data, such as GitHub repos and passwords. Isa describes emergent autonomous planning behaviors where the agent formulates research strategies without explicit training, signaling advanced problem-solving capabilities. Another key challenge discussed is managing latency; Deep Research is slower than typical search due to its complex tool use, and balancing speed with research depth remains an open design trade-off. Isa envisions a future where AI agents unify specialized capabilities—coding, browsing, reasoning—within seamless assistant-like experiences, allowing users to override or collaborate fluidly. The conversation touches on ongoing improvements via reinforcement fine-tuning, reducing hallucinations through citation transparency, and building trust through guardrails and confirmations to enable safe autonomous actions. Looking ahead, Deep Research and similar agents aim to compress tasks traditionally requiring days or weeks into hours or days, promising profound productivity gains subject to scaling and safety considerations. The episode closes on insights about agent memory and contextual continuity as crucial for maintaining long-term, compounding research workflows, alongside the importance of expanding the agent’s toolset and private data access for enterprise relevance.

Key Takeaways

  • 1AI agents like Deep Research face elevated security and privacy challenges when granted access to sensitive user data such as GitHub repositories, passwords, and confidential documents. Managing these risks requires robust privacy controls, data governance, and ethical design to prevent data breaches or unintended leaks.
  • 2Context window limitations in transformer models challenge agents' ability to handle extended, multi-hour tasks without losing track of prior information. Efficient context or memory management techniques are necessary to sustain continuity over long durations.
  • 3Deep Research is optimized for answering very specific, well-defined queries by combining base language model knowledge with live access to current online data, outperforming generalist models in precision and timeliness.
  • 4Latency remains a practical limitation for Deep Research, as the multi-step reasoning and tool usage involved cause longer response times compared to instantaneous search or simpler assistants. Finding a balance between depth and speed is a key UX challenge.
  • 5The future vision for AI agents includes unified systems that combine specialized competencies such as coding, browsing, and general assistance, enabling seamless transitions and user overrides, simulating collaboration with a human coworker.
  • 6Reinforcement fine-tuning (RFT) on browsing and interaction tasks has substantially improved Deep Research's reasoning and retrieval capabilities, though some failure modes with inexplicable errors persist.
  • 7Deep Research agents exhibit emergent autonomous planning despite no explicit training to do so, formulating research strategies before task execution, which enhances efficiency but requires careful oversight.
  • 8Managing safety risks from autonomous agent actions focuses less on traditional error prevention and more on avoiding unintended side effects, such as sending embarrassing emails, necessitating guardrails and confirmation protocols that evolve with user trust.
  • 9Reducing hallucinations in AI-generated responses remains a critical challenge; Deep Research models reduce hallucination rates compared to predecessors and employ citation mechanisms to facilitate user verification and trust.
  • 10By accelerating tasks that traditionally take days or weeks—such as large-scale research projects or thesis writing—Deep Research and similar AI agents have the potential to revolutionize productivity, contingent on overcoming scalability and safety hurdles.

Notable Quotes

""Deep research is very good when you have a very specific query or well-defined query. So maybe not a general overview of a topic, but some you're looking for some specific information and you think it would be supplemented by existing research online. Even if that information is also, you know, we also train the model on the base model on that information. I think having live access to it is quite useful.""

""Deep research is not instantaneous. It's thinking and using tools. Can it be faster? Yeah. I do think there's a good middle ground in between where sometimes you don't want it to do really deep research, but you want it to do more than a search. And I think that we will release things soon that people will be happy about and we'll fill that gap.""

""The mental model I have for this is my general ethos is actually I love the people I work with. I prefer to work with fewer people with less management overhead, all things considered, because each person has more context and I have more understanding of them. And so like the universally useful agent is attractive. And you only have to tell it something once and it will remember and then it will have state on everything you're working on. Things like that.""

""I think anything that would take, I mean, right now in five or 30 minutes, it can do what human experts rate take many hours. So I guess in an hour, it could do something that take a human days. In a day, it could, you know, do something that would take a human weeks. Obviously, there'll be a lot of challenges to get it to scale like that. But I think you can imagine it doing a research project that would have taken weeks to complete or like write a thesis or something like that.""

""So I think with deep research, since it can't actually take actions that aren't kind of the same class of the typical agent safety problems you would think of. But I think the fact that the responses are much more comprehensive and take longer means that people will trust them more. So I think maybe hallucinations is a bigger problem. While this model hallucinates less than any model that we've ever released, it is still possible for it to hallucinate most times because it will infer something incorrectly from one of its sources. So that's part of the reason we have citations, because it's very important that the user is able to check where the information came from. And if it's not correct, they can hopefully figure it out. But yeah, that's definitely one of the biggest model limitations and something that we're actively always working on to improve.""

""Like if you ask it to do a task for you and then in the process it sends an embarrassing email or something like this, you know, that's not a successful completion of the task. So I think that is going to be a much more interesting and difficult safety area that we're starting to tackle. You can tell me if you just don't have a projection here, but do you think people are going to want explicit guardrails? Do you think you can learn a bunch of those characteristics in the model itself? If you've used operator, I'm sure you have. You have to confirm every right action. I think to start with, that makes a lot of sense. You want to build trust with users. And as the models become more capable, maybe you've seen it successfully do things a few times and you start to trust them more. And so maybe you allow it to, okay, every time you don't have to ask me every time you send an email to these people, like that's fine. But I do think that as these agents start to roll out, we will definitely want to have guardrails and confirmation just so, you know, while they're not the end state capability, we still want to make sure we have like a good level of oversight. But I think that they will get so good that we'll just trust them to do things on our behalf.""

"This is a new agentic product that OpenAI released in February of this year, which uses reasoning and tools like web browsing to complete multi-step research tasks for you. Today, they're making it free to all U.S. users."

""As to deep research, I think obvious next steps for deep research would also be to have access to private data, like be able to do research over, you know, any internal documentation or GitHub, whatever it is. There's a golden thread here because when we first met, you were working on retrieval and I was like, there cannot be only one person at this company working on retrieval. Everything, all roads lead back to retrieval. So I think that will be really cool. And then eventually taking right actions or calling APIs. And then obviously there are just a lot of things that the model is not perfect at now that we just need to improve. But it's, I think we have a really cool, like working relationship with the reinforcement learning team. So a lot of teams will contribute data sets to the big runs that they do. So we contribute data sets. And then as they train models, you know, with a ton of compute, then it just becomes a better base model for us to continue training from. So just think the capabilities are compounding.""

""I think, you know, uploading a file or something like that and having it do some analysis for you or do some research and then create a report with numerical analysis is pretty interesting. I actually haven't tried this. So it's and it's not a it's not a browsing task. Like what makes the model particularly good at this or what is it capable of? Is it really like multi-step and then being able to do planning and understanding of the task and produce a report that's cohesive? Yeah, I think also the base model or the model that we started fine tuning from O3 is just very capable model. It's trained on many different data sets, including a lot of coding, reasoning and math tasks. So that inherited capability is pretty strong. And then when you add the browsing on top of that, it's still able to do that analysis. So I think those two together can be quite powerful.""

""And I think it's the stakes are way higher when it can when it has access to your GitHub repositories and your passwords and your private data. So I think that's a really big challenge.""

"We really started by grounding the research and what product use cases we actually wanted the final model to be good at. So we literally would write out just a list of things like, I hope the model could find this list of products for me and rank them by these reviews from Reddit or something like that. Or I want it to be able to write a literature review on this topic."

"I think the overall goal for OpenAI is to create an AGI that can make new scientific discoveries. And we kind of felt that a prerequisite to that is to be able to synthesize information. You know, if you can't write a literature review, you're not going to be able to write a new scientific paper. So it felt very in line with the company's broader goals."

"So right now we just have the browsing tool, which is a text-based browser, but it can see embedded images and open PDFs. And then also it has access to a Python tool. So it can do analysis and calculations and plot graphs and things like that. But you can imagine in future versions, we'll just expand the tool set."

"I think if you have a very specific task that you think is so different to anything that the model was likely trained on and you try it a bunch of times yourself and you've tried a lot of different prompts and it's just really not good at it... I think that is a good time to try reinforcement fine-tuning."