Morgan Cheatham’s Post

Vice President at Bessemer Venture Partners | bio, healthcare, AI | MD Candidate

2mo Edited

As the discourse shifts from models to compound AI systems / agents, we need better AI benchmarks to evaluate multi-modal and multi-step task performance, especially in healthcare and life sciences. When we wrote the first paper demonstrating ChatGPT's performance on the USMLE, we chose the US Medical Licensing Exam as a benchmark for accessibility, speed, and ease. This benchmark was never intended to represent AI model performance on real-world clinical tasks. Today, I still see so many research teams and startups using benchmarks (like the USMLE) that are ill-fitted for assessing the true clinical or scientific performance and utility of the models they are developing for real-world contexts. Benchmark development may be seen as a "less sexy" area of research, but it is of paramount importance. Years after the rise of the transformer, we still lack adequate benchmarks for so many single-step tasks in biomedicine. With compound AI systems (i.e., architectures that integrate multiple AI models to perform complex tasks) emerging, we need new benchmarks for agentic behaviors. I'd argue that developing an agent with novel capabilities without at least proposing a companion benchmark (if an industry standard does not yet exist) may hinder the adoption of said agent, especially for high-stakes workflows. Designing more benchmarks that capture/simulate real-world clinical and scientific workflows will help us mitigate the major discrepancies observed between in silico and in vivo performance and better support safe + effective deployment of AI in biomedicine. There are already brilliant people focused here, and we need more. DMs are open if you're researching or working in this area of multi-step/multi-modal benchmarking in healthcare and life sciences! #healthcare #ai #artificialintelligence #generativeai

18 Comments

Marc Ackermann

2mo

On top of general benchmarking being linked more closely to the specific healthcare tasks, the other important step is local validation of those benchmarks within specific healthcare organizations. Benchmarks of the desired outputs need to be both broadly comparable with the general model / type of data inputs, but also specifically comparable with that model and its results deployed on top of a specific healthcare data source(s). The variability across & within healthcare data modalities creates a need for site-specific performance validation prior to deployment.

2 Reactions

Qompass

2mo

https://youtu.be/Eb0Ga5igBuQ?si=YHOMJe3UMxYx5FS8 Great to see folks having an open dialogue on benchmarks given the recent advent of the Hugging Face Open Medical LLM Leaderboard :).

Andrew Hines

CEO at Canvas Medical

2mo

Absolutely right. But before we can hope to propose suitable benchmarks for real-world workflows, we need far more complete and fine-grained observability on the effects of agentic interventions on care team behavior across time. Health care IT historically has not done well measuring the simplest of things like questionnaire responses. The bar for instrumentation goes way, way up for complex transformer-based systems. This 10-year-old study (h/t Mark Friedberg) is a great example showing how dramatic the impact of software can be on clinician behavior, and how thoughtful the measurement needs to be in order to draw conclusions about safety and effectiveness. https://jamanetwork.com/journals/jama/fullarticle/2488307

7 Reactions

Andy Lee

Co-founder & CBO @ Vincere Biosciences | Conquering Age-Related Decline with AI-Driven Drug Discovery

2mo

Benchmarks are like biomarkers - everyone agrees they're needed; nobody wants to fund developing them.

2 Reactions

Spencer Dorn

Vice Chair & Professor of Medicine, UNC | Balanced healthcare perspectives

2mo

I agree Morgan. Showing an AI performs well on the USMLE is a nice party trick, but it does not help us assess real-world utility. Evaluation is one of those unglamorous yet critically important areas to get right. We must define the dimensions that matter, assign metrics that reflect those dimensions, and define what constitutes meaningful differences.

12 Reactions

Andrew Gostine, MD, MBA

CEO, Artisight, Inc.

2mo

This is why Artisight generates its own HIPAA compliant training and validation data at each client site, based on actual healthcare scenarios, observed in actual hospital environments. You are spot on Morgan Cheatham.

6 Reactions

Lendahire

2mo

Absolutely spot-on! 🧬

Graham Walker, MD

AI/Tech Innovation @ TPMG | Medical AI & Informatics Strategy | MDCalc Creator

2mo

I have an article (not a post, it's a nerdy deep dive) I'm writing on this exact topic. The space is an absolute MESS, and I'm even embarrassed it took me this long to look into it. I'd ignorantly assumed there were adults in the room making sure someone was paying attention

1 Reaction

ManyMangoes 🥭

2mo

Intriguing insights, truly. Consider leveraging the concept of multi-tiered testing, like A/B/C/D/E/F/G, to identify more nuanced insights into AI performance across varied clinical scenarios, enriching benchmark quality and applicability.

Omega

2mo

Innovation in AI benchmarks is vital for bridging the gap between theory and practice, much like Socrates highlighted the importance of questioning for deeper understanding 🌟 #innovation #AIprogress

See more comments

To view or add a comment, sign in

More Relevant Posts

Isaac Kohane
2mo
Report this post
Not only are USMLE tests incomplete and limited measures of clinical performance but frontier models are now so good that the remaining dynamic range of the benchmark is narrow. More holistic performance measures are required but hard to develop. Therefore prospective studies to measure impact on patients and clinicians are all the more urgent.

Morgan Cheatham Morgan Cheatham is an Influencer

Vice President at Bessemer Venture Partners | bio, healthcare, AI | MD Candidate
2mo Edited

As the discourse shifts from models to compound AI systems / agents, we need better AI benchmarks to evaluate multi-modal and multi-step task performance, especially in healthcare and life sciences. When we wrote the first paper demonstrating ChatGPT's performance on the USMLE, we chose the US Medical Licensing Exam as a benchmark for accessibility, speed, and ease. This benchmark was never intended to represent AI model performance on real-world clinical tasks. Today, I still see so many research teams and startups using benchmarks (like the USMLE) that are ill-fitted for assessing the true clinical or scientific performance and utility of the models they are developing for real-world contexts. Benchmark development may be seen as a "less sexy" area of research, but it is of paramount importance. Years after the rise of the transformer, we still lack adequate benchmarks for so many single-step tasks in biomedicine. With compound AI systems (i.e., architectures that integrate multiple AI models to perform complex tasks) emerging, we need new benchmarks for agentic behaviors. I'd argue that developing an agent with novel capabilities without at least proposing a companion benchmark (if an industry standard does not yet exist) may hinder the adoption of said agent, especially for high-stakes workflows. Designing more benchmarks that capture/simulate real-world clinical and scientific workflows will help us mitigate the major discrepancies observed between in silico and in vivo performance and better support safe + effective deployment of AI in biomedicine. There are already brilliant people focused here, and we need more. DMs are open if you're researching or working in this area of multi-step/multi-modal benchmarking in healthcare and life sciences! #healthcare #ai #artificialintelligence #generativeai

4 Comments
Like Comment
To view or add a comment, sign in
Saeed Alghamdi

Connecting Hospitals with Medical AI Startups | Advocate for AI in Healthcare| AI Graduate | Twice Stroke Survivor
2mo
Report this post
There are many obstacles we ought to face when attempting to tackle that. Althogh I'm a huge advocate for LLMs especially GPTs (generative pre-trained transformers) as they made an advancementwithin NLP research, multi-tasked models are not at the stage where we could rely on changing the benchmark to real scenarios will enforce upcoming solutions to train their models on real data (which are expensive and has many ethical issues). I'll keep it to these two points to keep it short

Morgan Cheatham Morgan Cheatham is an Influencer

Vice President at Bessemer Venture Partners | bio, healthcare, AI | MD Candidate
2mo Edited

As the discourse shifts from models to compound AI systems / agents, we need better AI benchmarks to evaluate multi-modal and multi-step task performance, especially in healthcare and life sciences. When we wrote the first paper demonstrating ChatGPT's performance on the USMLE, we chose the US Medical Licensing Exam as a benchmark for accessibility, speed, and ease. This benchmark was never intended to represent AI model performance on real-world clinical tasks. Today, I still see so many research teams and startups using benchmarks (like the USMLE) that are ill-fitted for assessing the true clinical or scientific performance and utility of the models they are developing for real-world contexts. Benchmark development may be seen as a "less sexy" area of research, but it is of paramount importance. Years after the rise of the transformer, we still lack adequate benchmarks for so many single-step tasks in biomedicine. With compound AI systems (i.e., architectures that integrate multiple AI models to perform complex tasks) emerging, we need new benchmarks for agentic behaviors. I'd argue that developing an agent with novel capabilities without at least proposing a companion benchmark (if an industry standard does not yet exist) may hinder the adoption of said agent, especially for high-stakes workflows. Designing more benchmarks that capture/simulate real-world clinical and scientific workflows will help us mitigate the major discrepancies observed between in silico and in vivo performance and better support safe + effective deployment of AI in biomedicine. There are already brilliant people focused here, and we need more. DMs are open if you're researching or working in this area of multi-step/multi-modal benchmarking in healthcare and life sciences! #healthcare #ai #artificialintelligence #generativeai
Like Comment
To view or add a comment, sign in
Hashem Ebrahimi

Senior Executive - Specializing in Engineering, Networking, Cybersecurity, and Software Development Cycles
8mo
Report this post
Revolutionizing Healthcare: AI's Role in Diagnosis The future of healthcare is here, and it's powered by Artificial Intelligence! As we step into an era of incredible technological advancements, one question looms large: Can AI be better than your doctor at making accurate diagnoses? A groundbreaking study sheds light on the potential of AI in revolutionizing the medical field. The pivotal role AI plays in enhancing healthcare outcomes. https://lnkd.in/gkARSush

Is AI better than your doctor? A new study tests the ability of AI to get the right diagnosis

interestingengineering.com

2 Comments
Like Comment
To view or add a comment, sign in
Aleksey Zavgorodniy

CEO & Founder at Unicsoft | GenAI, ML, Data Science | Pharma & Healthcare AI Innovator
8mo
Report this post
Seven years ago, the godfather of artificial intelligence, Geoffrey Hinton, made a bold claim: 🗨️ Machine learning tools will get so good at reading and interpreting medical scans and X-rays that radiologists will simply become redundant. And although this statement hurt and frightened many radiologists, they are still here and not going anywhere 🩻 Because people trust people. Doctors seem to be less enthusiastic about AI-based tools than investors, with only 10-30% using them. This article on the AI revolution in medicine is a great piece that discusses the opportunities and obstacles for AI-based tools. Here are some of them: 🔹 AI can be right and recognize things that doctors may have missed. On the other hand, doctors may also recognize something that a machine has missed. Who is responsible in this case? 🔹 Some AI tools have been developed without consulting doctors and are therefore of little to no use (e.g., detecting very common things like bone fractures, etc.) 🔹 Most of the current tools serve particular functions, so healthcare providers are more open to the idea of generalist medical AI — models similar to ChatGPT. They act like physicians assessing different abnormalities and making a diagnosis. “The real goal to me is for AI to help us do the things that humans aren’t very good at,” says radiologist Bibb Allen. And we couldn’t agree more. Yes, the role of AI will continue to grow 📈 but we don't see it as a replacement but as a helpful tool. Once medical professionals see the benefits that AI tools can bring to them personally, they will be more open to the changes. #AIinHealthcare #AIdiagnostics #Unicsoft
1 Comment
Like Comment
To view or add a comment, sign in
Jesse Tyber White

Generative AI & Information Technology
10mo Edited
Report this post
OpenEvidence AI becomes the first AI in history to score above 90% on the United States Medical Licensing Examination (USMLE) There are many excellent use cases for artificial intelligence, but nothing more important and vital then the health and wellbeing of people. Integrating AI into healthcare and wellbeing can not only lead to significant advancements in diagnosing diseases, personalized treatments, and preventive care, but it can also democratize healthcare, making it more accessible and affordable to underserved populations. Focusing efforts on this area harnesses technology's potential for immediate and profound societal impact. We should strongly support any and all endeavors that attempt to leverage technology to improve the core of who we are and how we live. The dividends from these investments are shared by the world. #openevidence #ai #artificialintelligence #healthcare #wellbeing #livebetter

OpenEvidence - OpenEvidence AI becomes the first AI in history to score above 90% on the United States Medical Licensing Examination (USMLE)

openevidence.com

2 Comments
Like Comment
To view or add a comment, sign in
Causaly

19,133 followers
5mo
Report this post
📢 Last chance to register! Interested in how AI, including Large Language Models like Gemini and ChatGPT, is transforming preclinical research? Join us for an insightful and hands-on session, where we'll walk-through the most relevant AI trends for early drug discovery, address key concerns, and demonstrate how AI can be used to support your research. Save your spot: https://hubs.la/Q02l06dN0 📆What's On The Agenda? 🔹Latest AI Trends: Discover the role of LLMs like Gemini and ChatGPT in early stages of drug discovery. 🔹Human-Centric AI: Learn how AI can enhance human expertise, exploring its advantages and challenges for preclinical research. 🔹AI Confidence: Gain insights on leveraging AI for reliable scientific discoveries. 🔹Workflow Integration: See how research teams are integrating LLMs to shorten R&D cycle times and save costs. 🔹Live Demonstration: Experience the future - Navigate and extract insights from biomedical data like never before. #BioTech #AITrends
1 Comment
Like Comment
To view or add a comment, sign in
AIQRATE

1,250 followers
1mo
Report this post
AIQRATE Global Gen AI Adoption Report 2024 - Healthcare & Life sciences Industry Gen AI is revolutionizing the Healthcare and Life Sciences (HCLS) sector, driving unprecedented advancements and transformative change. By leveraging sophisticated algorithms and machine learning, Generative AI excels in generating contextually relevant data, images, and text. This technology goes beyond automation, sparking innovation in research, development, and patient care within HCLS. In clinical trials, Generative AI is fostering significant changes across all stages. During Research and Discovery, it accelerates hypothesis generation and speeds up drug compound exploration. In Pre-Clinical and Clinical Development, it refines trial designs and enhances patient recruitment strategies. It also plays a crucial role in Post-Market Surveillance by facilitating real-time safety monitoring and predicting adverse events. consult@aiqrate.ai | www.aiqrate.ai #genai #artificialintelligence #trends2024 #healthcare #lifesciences Sameer Dhanrajani Nandakumar Ramaiah
Like Comment
To view or add a comment, sign in
Preeti Cholleti

🤝 Follow me and be a part of the worlds 🌎 largest AI Database
6mo
Report this post
Revolutionary AI Tool Transforming Medical History-Taking and Diagnosis #HealthTechRevolution 🤝 Follow us on Discord 🔜: https://lnkd.in/gt823Zd3 🤝 Follow us on Whatsapp 🔜 https://wapia.in/wabeta _ ❇️ Summary: Medical history-taking and diagnosis are being transformed by artificial intelligence (AI), with Google's AMIE leading the way. AMIE boasts a diagnostic accuracy of 91.3% and excels in engaging conversations with patients. It uses goal-oriented dialogue to swiftly gather relevant information for precise diagnoses. While AI like AMIE has the potential to democratize healthcare, it should not replace human physicians. Future steps for AMIE include addressing biases and privacy concerns. The integration of AI into clinical practice signals a significant shift in medical diagnostics, promising more efficient and outcome-oriented conversations. Hashtags: #chatGPT #AIinMedicine #MedicalInnovation

Revolutionary AI Tool Transforming Medical History-Taking and Diagnosis #HealthTechRevolution

https://webappia.com
Like Comment
To view or add a comment, sign in
Eugene Terente

Co-Founder @ EDGED.AI | Master of Data Science
2mo
Report this post
Just got wind of an exciting development in the realm of AI and healthcare—the PMC-LLaMA model. This open-source language model has been fine-tuned specifically for medical applications, showing promise in areas where precision is critical, like medical diagnostics and patient interactions. The team behind PMC-LLaMA adapted a general LLM by integrating a massive amount of biomedical literature—think 4.8 million academic papers and 30K textbooks. With a focus on enhancing its capabilities in medical QA and conversational accuracy, this model uses 202M tokens for domain-specific training, making it not just powerful but also highly specialized. What's impressive is that despite its "lightweight" status at 13 billion parameters, PMC-LLaMA has outperformed giants like ChatGPT in medical QA benchmarks. This breakthrough is not just about advancing medical AI but also making these advancements accessible, as all models, codes, and datasets will be openly shared with the research community. For developers and researchers looking to dive into medical AI, PMC-LLaMA represents a valuable resource. And if you're looking to run open-source LLMs efficiently, Edged.ai 🟩 Making machines intelligent since 2017 offers the infrastructure to harness these models right at the edge of the network—enhancing performance while keeping data processing local and swift. This is a big step forward in making AI a practical part of medical problem-solving and patient care. #AI #HealthcareInnovation #OpenSource #MedicalAI #edgedAI
Like Comment
To view or add a comment, sign in
Ismael Valladolid Acebes, BC-MSLcert, PharmD, PhD

Research Specialist at Karolinska Institutet | Director Organizational Development | Board-Certified MSL Diabetes
4w
Report this post
Transforming #RegulatorySubmissions for #NSCLCImmunotherapies: The Power of #AI in #RiskMitigation, #Prediction, and #ApprovalSuccess Implementing #ArtificialIntelligence (AI) in this critical phase of #drugdevelopment holds immense promise, potentially leading to more successful outcomes and offering new hope for patients battling NSCLC. This transformative technology is playing a pivotal role in enhancing #RiskMitigation, #Prediction, and the #OverallSuccessRateOfApproval processes. Here’s how we are doing at Datum: #Risk Mitigation AI systems are adept at analyzing vast datasets to identify potential risks early in the drug development process. By leveraging #MachineLearning #Algorithms, AI can detect #Patterns and anomalies that might indicate #SafetyConcerns or #EfficacyIssues. This proactive approach enables researchers and #RegulatoryBodies to address potential problems before they escalate, thereby ensuring a safer and more reliable therapeutic development pipeline. #Prediction The predictive capabilities of AI are revolutionizing the way clinical outcomes are forecasted. By utilizing predictive analytics, AI can assess the likelihood of a drug’s success based on historical data, genetic markers, and patient profiles. This helps in designing more #EffectiveClinicalTrials, optimizing #PatientSelection, and ultimately improving the chances of achieving #PositiveTrialOutcomes. Predictive models can also streamline the submission process by providing regulators with robust, data-driven evidence of a drug’s potential efficacy and safety. #ApprovalSuccess AI-driven tools enhance the quality and comprehensiveness of #RegulatorySubmissions. #NaturalLanguageProcessing (NLP) algorithms can assist in drafting and reviewing submission documents, ensuring they meet regulatory standards and are free from errors. Moreover, AI can facilitate #RealTimeCommunication between #PharmaceuticalCompanies and #RegulatoryAgencies, enabling a more efficient and transparent review process. The result is a higher likelihood of timely approvals, bringing life-saving therapies to patients faster. Do you want to know more? Scan the #QRCode in the flyer and visit us at: www.datum.bio
Like Comment
To view or add a comment, sign in

22,863 followers

View Profile Follow

Morgan Cheatham’s Post

More from this author

Personalized care is in demand — here’s how healthcare is finally focusing on underserved patient populations in 2021

Explore topics