Morgan Cheatham’s Post

View profile for Morgan Cheatham, graphic
Morgan Cheatham Morgan Cheatham is an Influencer

Vice President at Bessemer Venture Partners | bio, healthcare, AI | MD Candidate

As the discourse shifts from models to compound AI systems / agents, we need better AI benchmarks to evaluate multi-modal and multi-step task performance, especially in healthcare and life sciences. When we wrote the first paper demonstrating ChatGPT's performance on the USMLE, we chose the US Medical Licensing Exam as a benchmark for accessibility, speed, and ease. This benchmark was never intended to represent AI model performance on real-world clinical tasks. Today, I still see so many research teams and startups using benchmarks (like the USMLE) that are ill-fitted for assessing the true clinical or scientific performance and utility of the models they are developing for real-world contexts. Benchmark development may be seen as a "less sexy" area of research, but it is of paramount importance. Years after the rise of the transformer, we still lack adequate benchmarks for so many single-step tasks in biomedicine. With compound AI systems (i.e., architectures that integrate multiple AI models to perform complex tasks) emerging, we need new benchmarks for agentic behaviors. I'd argue that developing an agent with novel capabilities without at least proposing a companion benchmark (if an industry standard does not yet exist) may hinder the adoption of said agent, especially for high-stakes workflows. Designing more benchmarks that capture/simulate real-world clinical and scientific workflows will help us mitigate the major discrepancies observed between in silico and in vivo performance and better support safe + effective deployment of AI in biomedicine. There are already brilliant people focused here, and we need more. DMs are open if you're researching or working in this area of multi-step/multi-modal benchmarking in healthcare and life sciences! #healthcare #ai #artificialintelligence #generativeai

On top of general benchmarking being linked more closely to the specific healthcare tasks, the other important step is local validation of those benchmarks within specific healthcare organizations. Benchmarks of the desired outputs need to be both broadly comparable with the general model / type of data inputs, but also specifically comparable with that model and its results deployed on top of a specific healthcare data source(s). The variability across & within healthcare data modalities creates a need for site-specific performance validation prior to deployment.

https://youtu.be/Eb0Ga5igBuQ?si=YHOMJe3UMxYx5FS8 Great to see folks having an open dialogue on benchmarks given the recent advent of the Hugging Face Open Medical LLM Leaderboard :).

Like
Reply
Andrew Hines

CEO at Canvas Medical

2mo

Absolutely right. But before we can hope to propose suitable benchmarks for real-world workflows, we need far more complete and fine-grained observability on the effects of agentic interventions on care team behavior across time. Health care IT historically has not done well measuring the simplest of things like questionnaire responses. The bar for instrumentation goes way, way up for complex transformer-based systems. This 10-year-old study (h/t Mark Friedberg) is a great example showing how dramatic the impact of software can be on clinician behavior, and how thoughtful the measurement needs to be in order to draw conclusions about safety and effectiveness. https://jamanetwork.com/journals/jama/fullarticle/2488307

Andy Lee

Co-founder & CBO @ Vincere Biosciences | Conquering Age-Related Decline with AI-Driven Drug Discovery

2mo

Benchmarks are like biomarkers - everyone agrees they're needed; nobody wants to fund developing them.

Spencer Dorn

Vice Chair & Professor of Medicine, UNC | Balanced healthcare perspectives

2mo

I agree Morgan. Showing an AI performs well on the USMLE is a nice party trick, but it does not help us assess real-world utility. Evaluation is one of those unglamorous yet critically important areas to get right. We must define the dimensions that matter, assign metrics that reflect those dimensions, and define what constitutes meaningful differences.

This is why Artisight generates its own HIPAA compliant training and validation data at each client site, based on actual healthcare scenarios, observed in actual hospital environments. You are spot on Morgan Cheatham.

Absolutely spot-on! 🧬

Like
Reply
Graham Walker, MD

AI/Tech Innovation @ TPMG | Medical AI & Informatics Strategy | MDCalc Creator

2mo

I have an article (not a post, it's a nerdy deep dive) I'm writing on this exact topic. The space is an absolute MESS, and I'm even embarrassed it took me this long to look into it. I'd ignorantly assumed there were adults in the room making sure someone was paying attention

Intriguing insights, truly. Consider leveraging the concept of multi-tiered testing, like A/B/C/D/E/F/G, to identify more nuanced insights into AI performance across varied clinical scenarios, enriching benchmark quality and applicability.

Like
Reply

Innovation in AI benchmarks is vital for bridging the gap between theory and practice, much like Socrates highlighted the importance of questioning for deeper understanding 🌟 #innovation #AIprogress

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics