AI + a16z
EpisodeAI + a16z

Evals, Feedback Loops, and the Engineering That Makes AI Work

Feb 17, 2026
Listen Now

Summary

The episode focuses on where engineering effort matters most in AI products versus where brute-force compute and data dominate. Martin Casado and Ankur Goyal argue that production engineering—evals, feedback loops, and integration quality—often matters more to product success than using the newest or largest foundation model. They discuss how open-source and Chinese models drive very high token volumes but low dollar-weighted spend because of delivery, reliability, and integration gaps. The conversation also contrasts approaches for agent interfaces, showing structured, typed access (e.g., SQL) outperforms unconstrained 'computer' access (bash/Unix) in many production tasks. Finally, they frame evals as the scientific method applied to non-deterministic software and suggest a shift to engineering wins once brute-force gains taper or funding normalizes.

Key Takeaways

  • 1Engineering around models (evals, feedback loops, production harnesses) is often more important than the absolute best foundation model.
  • 2Evals should be treated like the scientific method and serve as declarative product specifications.
  • 3High token usage of open-source/Chinese models doesn't translate to high dollar-weighted spend due to integration and delivery friction.
  • 4Structured, typed interfaces (e.g., SQL + typed schemas) outperform unconstrained 'computer-like' access (bash/Unix) for many agent tasks.
  • 5Whether engineering or brute-force compute dominates future progress is largely a capital-flow question.

Notable Quotes

"The company's shipping AI products that actually work aren't using the smartest models. They're the ones with the best engineering around the models."

"Evals are like the scientific method applied to software engineering with non-deterministic systems like AI systems."

"In terms of number of tokens across our customer base usage of the Chinese models is very high... Dollar weighted, it's low."

"SQL is more accurate, it's more efficient, it's more token efficient, it's faster."

Episode questions

What are evals and why do they matter for AI products?

Evals are hypothesis-driven tests that simulate inputs and measure outputs (quantitatively and qualitatively) for non-deterministic systems; they act like a scientific method for AI and serve as a declarative product specification for iteration and production feedback loops.

Why aren't Chinese/open-source models dollar-weighted in usage despite high token volumes?

Because many open-source providers have worse delivery (rate limits, APIs), integration friction, and higher error rates; customers often trade predictability and uptime for cost, preferring commercial providers despite cheaper token economics.

When will engineering (not brute-force compute) become the dominant limiter of progress?

Ankur suggests it's a capital-flow question: as long as frontier labs can raise massive funds, brute force continues; once marginal model improvements slow or funding normalizes, engineering efficiency (data pipelines, inference optimizations) will become the key lever.

Which interface is better for agents: giving a 'computer' (bash) or structured access (SQL/types)?

Their benchmark shows structured access (SQL + typed schemas) is more accurate, token-efficient and faster for many production tasks; constraining the environment with CS fundamentals yields more reliable results than brute-force bash approaches.