Beyond Vibe Testing: Smarter Eval for Agentic AI
Inference Time Tactics

Beyond Vibe Testing: Smarter Eval for Agentic AI

2025-09-08
In this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation.   We talked about:   Why Salesforce’s CRMArena-Pro benchmark highlights the gap between lab benchmarks and real-world agent reliabi...
View more
Comments (3)

More Episodes

All Episodes>>

Get this podcast on your phone, Free

Create Your Podcast In Minutes

  • Full-featured podcast site
  • Unlimited storage and bandwidth
  • Comprehensive podcast stats
  • Distribute to Apple Podcasts, Spotify, and more
  • Make money with your podcast
Get Started
It is Free