Practical AI: Machine Learning, Data Science, LLM

AI incidents, audits, and the limits of benchmarks

Feb 13, 2026
Listen Now

Summary

The episode explores how AI is transitioning from research to consequential real-world deployment, focusing on incident reporting, auditing, and the limits of benchmarks. Sean McGregor describes the AI Incident Database—its scale, harm-based definition of incidents, and sourcing challenges—and argues that collected incidents create learnable datasets akin to aviation or medical adverse-event reporting. The guests examine how general-purpose LLMs (e.g., GPT-like models) break traditional safety assumptions, making exhaustive verification infeasible and increasing the need for domain-specific pilots, red-teaming, and meta-evaluation of benchmarks. They also discuss practical governance questions: voluntary versus mandatory reporting, the utility and limits of benchmarks and leaderboards, and the growing role of third-party audits to validate vendor claims.

Key Takeaways

  • 1A curated incident corpus is essential: the AI Incident Database provides scale and structure for learning from failures.
  • 2Define incidents by harm to prioritize actionable reporting and remediation.
  • 3Benchmarks and leaderboards are insufficient without meta-evaluation and domain validation.
  • 4General-purpose LLMs make exhaustive safety verification impractical — run pilots and domain-specific tests instead.
  • 5Independent third-party audits increase trust by validating evidence rather than accepting vendor claims.
  • 6Sourcing and coverage remain hard — there’s a trade-off between cataloging many small harms and focusing on high-impact systemic incidents.

Notable Quotes

"And so we've, in that project, collected more than 5,000 human annotated reports of AI incidents."

"Those are collected across more than 1,000 discrete incident records at this point."

"effectively an event that a harm has taken place. That's an incident."

"Like if you make a billion people slightly more depressed, non-zero number of people probably have died as a result of that."

"This basically broke the safety frame."

"Cause the answer is going to be no."

"The world is hard. The real world is real hard."

"Evaluating the benchmarks are useful is you're basically checking those receipts."