In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.
0:25 - Technical wrap: what are agents?
13:20 - What are benchmarks?
18:20 - Automated evaluation
20:10 - Benchmarks
37:45 - Human feedback
44:50 - LLM as judge
Read more about the projects we discuss here:
Meditron
Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper
More details in the show notes on our website.
Episodes | Bluesky | info@medicalattention.ai