AI + a16z
EpisodeAI + a16z

Evals, Feedback Loops, and the Engineering That Makes AI Work

Feb 17, 2026
Listen Now

Summary

The episode focuses on where engineering effort matters most in AI products versus where brute-force compute and data dominate. Martin Casado and Ankur Goyal argue that production engineering—evals, feedback loops, and integration quality—often matters more to product success than using the newest or largest foundation model. They discuss how open-source and Chinese models drive very high token volumes but low dollar-weighted spend because of delivery, reliability, and integration gaps. The conversation also contrasts approaches for agent interfaces, showing structured, typed access (e.g., SQL) outperforms unconstrained 'computer' access (bash/Unix) in many production tasks. Finally, they frame evals as the scientific method applied to non-deterministic software and suggest a shift to engineering wins once brute-force gains taper or funding normalizes.

Key Takeaways

  • 1Engineering around models (evals, feedback loops, production harnesses) is often more important than the absolute best foundation model.
  • 2Evals should be treated like the scientific method and serve as declarative product specifications.
  • 3High token usage of open-source/Chinese models doesn't translate to high dollar-weighted spend due to integration and delivery friction.
  • 4Structured, typed interfaces (e.g., SQL + typed schemas) outperform unconstrained 'computer-like' access (bash/Unix) for many agent tasks.
  • 5Whether engineering or brute-force compute dominates future progress is largely a capital-flow question.

Notable Quotes

"The company's shipping AI products that actually work aren't using the smartest models. They're the ones with the best engineering around the models."

"Evals are like the scientific method applied to software engineering with non-deterministic systems like AI systems."

"In terms of number of tokens across our customer base usage of the Chinese models is very high... Dollar weighted, it's low."

"SQL is more accurate, it's more efficient, it's more token efficient, it's faster."