Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

Inference Time Tactics

Beyond Vibe Testing: Smarter Eval for Agentic AI

2025-09-08

In this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation. We talked about: Why Salesforce’s CRMArena-Pro benchmark highlights the gap between lab benchmarks and real-world agent reliabi...

In this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation.

We talked about:

Why Salesforce’s CRMArena-Pro benchmark highlights the gap between lab benchmarks and real-world agent reliability.
How leading models perform inconsistently across single-turn and multi-turn enterprise tasks.
Why benchmark scores are weak predictors of operational success in production.
The role of inference-time tactics in reducing variance and improving stability.
NeuroMetric’s new platform: ITC Test Engine and drag-and-drop interface for experimentation.
Challenges in building agentic systems, from database integration to managing multi-prompt complexity.
Why large language models’ stochastic nature conflicts with business demands for reliability.
Latency, cost, and rate limits as major bottlenecks in scaling agentic workflows.
The limits of “vibe testing” and why rigorous evaluation frameworks are essential.
How Google’s Stacks tool speeds up evaluation with LLM-as-judge, and why it still falls short for enterprise needs.

Resources Mentioned:

CRMArena-Pro from Saleforce:

https://www.salesforce.com/blog/crmarena-pro/

Connect with Neurometric:
Website: https://www.neurometric.ai/

Substack: https://neurometric.substack.com/

X: https://x.com/neurometric/

Bluesky: https://bsky.app/profile/neurometric.bsky.social

Hosts:

Rob May

https://x.com/robmay

https://www.linkedin.com/in/robmay

Calvin Cooper

https://x.com/cooper_nyc_

https://www.linkedin.com/in/coopernyc

Guest/s:

Byron Galbraith

https://x.com/bgalbraith

https://www.linkedin.com/in/byrongalbraith

View more

Comments (3)

More Episodes

You may also like

MPIR Old Time Radio

Ham Radio Crash Course Podcast

Conversations on the Creek

Elliot in the Morning

Podbean Amplified

Lex Fridman Podcast

The Ultimate Art Bell Podcast Feed

Darknet Diaries

Agatha Christie BBC Dramatisations

Get this podcast on your phone, Free

Create Your Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com