AI incidents, audits, and the limits of benchmarks

Feb 13, 2026

AI Product Management User Experience Business Startups

Summary

The episode examines the gap between research benchmarks and real-world AI safety, drawing on Sean McGregor’s work with the AI Incident Database and the AI Verification & Evaluation Research Institute. It emphasizes that practical AI is defined by systems that produce real-world consequences, and that benchmarks and lab tests often fail to predict brittle failures in deployed systems. The conversation covers sourcing and classifying incidents, challenges of voluntary reporting versus potential mandatory reporting, and the scale trade-offs of indexing many small harms versus focusing on high-impact events. The hosts also discuss the role of third-party audits, lessons from red-teaming (e.g., DEF CON exercises), and the need for new evaluation approaches for general-purpose models and composed systems.

Key Takeaways

1Focus safety work on deployed systems with real-world consequences rather than academic demos.
2Benchmarks and lab tests are necessary for research but insufficient as deployment guarantees.
3Incident collection at scale enables pattern discovery but faces sourcing limits that regulation could address.
4General-purpose frontier models make exhaustive verification infeasible, requiring new verification and high-level guarantees.
5Third-party audits are increasingly essential to build trust and validate vendor claims.
6Red-teaming and security-style testing reveal exploitable integration and handoff failures that benchmarks miss.

Notable Quotes

"I focused on reinforcement learning as applied to wildfire suppression policy... I had a very strong sense of the power of the technology ... but also the brittleness of it."

"And so we've, in that project, collected more than 5,000 human annotated reports of AI incidents."

"Those are collected across more than 1,000 discrete incident records at this point."

"We can prove existence. We can prove it's happening."

"This, this basically broke the safety frame."

"Cause the answer is going to be no."

"The world is hard."

"Evaluating the benchmarks are useful is you're basically checking those receipts."

← All episodes Browse issues