Last Week in AI

#235 - Opus 4.6, GPT-5.3-codex, Seedance 2.0, GLM-5

Feb 16, 2026
Open in new tab →

Summary

The episode surveys major AI product and research news, emphasizing a wave of capability and infrastructure advances across large models, generative media, and open-weight competitors. Hosts highlight Anthropic’s Opus 4.6 with a 1M-token context window and agent-team features, OpenAI’s GPT-5.3 Codex plus a low-latency Codex Spark on Cerebras, and Google’s Gemini 3 Deep Think which posts large benchmark gains amid sparse safety documentation. Significant progress in generative media is covered—ByteDance’s Seedance 2.0, Seedream 5.0, Alibaba’s Qwen Image 2.0, and xAI’s Grok Imagine API push text/image-to-video realism and multi-input prompting. The episode also discusses ecosystem dynamics: open and hybrid releases (GLM-5, Qwen3 Coder Next, DeepSeek), adapter efficiency (Tiny LoRA), reinforcement-style world-model learning for agents, and the security and evaluation challenges that accompany rapid rollout.

Key Takeaways

  • 1Very large context windows are becoming a practical differentiator for long-form reasoning and multi-scene inputs.
  • 2Latency-optimized model variants shift deployment trade-offs toward interactive use-cases.
  • 3Generative video and multi-input media models are rapidly improving and have implications beyond content creation.
  • 4Small, targeted adapters can capture most reasoning gains with minimal parameters.
  • 5Evaluation, safety documentation, and heterogeneous eval setups complicate direct model comparisons.
  • 6Training agents via diverse, random interactions to learn world models improves sample efficiency for downstream RL.

Notable Quotes

"Opus 4.6 notably has a 1 million token context window as opposed to the 200,000 token video of previous models."

"They say it's at more than 1000 tokens per second which is at least a 5x speed up... typical thing is 100 to 200 tokens per second."

"On ARC-AGI-2 Gemini 3 Deep Think got to 84.6% pass rate that's compared to 68.8% from Opus 4.6."

"Roughly speaking there are 13 parameters that they can use to capture 90 of the performance on reasoning tasks that you get from a fully fine-tuned reasoning model."

"The cool thing is you actually don't need good trajectories to learn good world models — you just need diversity in your interactions and even a terrible policy ... still creates thousands of valid state transitions."

"Opus 4.6 did a really good job; it achieved an average balance of over eight thousand dollars [on vending-bench]."

"Hot mess is the opposite of systematic misalignment."