Hugging Face releases a benchmark for testing generative AI for healthcare tasks

April 19, 2024

174

Generative AI models are increasingly being introduced into healthcare organizations – perhaps too early in some cases. Early adopters imagine they allow greater efficiency while gaining insights that may otherwise be missed. Critics, meanwhile, indicate that these models have flaws and biases that might contribute to poorer health outcomes.

But is there a quantitative solution to work out how helpful or harmful a model is perhaps in the case of tasks like summarizing patient records or answering health-related questions?

Hugging Face, the AI startup, proposes an answer newly released benchmark test called Open Medical-LLM. Developed in collaboration with researchers from the non-profit Open Life Science AI and the Natural Language Processing Group on the University of Edinburgh, Open Medical-LLM goals to standardize the assessment of the performance of generative AI models across a variety of medicine-related tasks.

New: Open Medical LLM Leaderboard! 🩺

In easy chatbots, errors are a nuisance.
In medical LLM, mistakes can have life-threatening consequences 🩸

Therefore, it’s important to check or track progress in medical LLMs before considering use.

Blog: https://t.co/pddLtkmhsz

— Clémentine Fourrier 🍊 (@clefourrier) April 18, 2024

Open Medical-LLM is just not a benchmark per se, but fairly a compilation of existing test sets – MedQA, PubMedQA, MedMCQA, etc. – designed to look at models of general medical knowledge and related fields resembling anatomy, pharmacology, genetics, and clinical practice . The benchmark includes multiple-choice and open-ended questions that require medical reasoning and understanding, and draws on materials resembling US and Indian medical licensing exams and college biology test query banks.

“(Open Medical-LLM) enables researchers and practitioners to discover the strengths and weaknesses of various approaches, drive further advances in the sector, and ultimately contribute to raised patient care and outcomes,” Hugging Face wrote in a blog post.

Photo credit: Hugging face

Hugging Face positions the benchmark as a “robust assessment” of generative AI models in healthcare. However, some medical examiners warned on social media against placing an excessive amount of emphasis on Open Medical-LLM lest it result in ill-informed deployments.

Regarding

It's an ideal step forward to see these comparisons directly, but it surely's vital that we also remember how big the gap is between the synthetic environment of answering medical questions and actual clinical practice! Not to say the idiosyncratic risks that these metrics cannot capture.

— Liam McCoy, MD MSc (@LiamGMcCoy) April 18, 2024

Hugging Face researcher Clémentine Fourrier, co-author of the blog post, agreed.

“These leaderboards should only be used as a primary approximation of which (generative AI model) to probe for a specific use case, but then a deeper testing phase is at all times required to look at the model's limitations and relevance in real-world conditions.” Fourrier replied to

It's paying homage to Google's experience when it tried to introduce an AI diabetic retinopathy screening tool into Thailand's healthcare systems.

Google has developed a deep learning system that scans images of the attention and appears for evidence of retinopathy, a number one reason for vision loss. But despite high theoretical accuracy The tool proved impractical in real-world testingfrustrating each patients and caregivers with inconsistent results and a general lack of harmony with local practices.

Significantly, of the 139 AI-related medical devices the US Food and Drug Administration has approved to date, no person uses generative AI. It is incredibly difficult to check how the performance of a generative AI tool within the lab will impact hospitals and outpatient clinics and, perhaps more importantly, how the outcomes might evolve over time.

This is just not to say that Open Medical-LLM is just not useful or informative. If nothing else, the list of results serves as a reminder of how models answer basic health questions. But Open Medical-LLM and no other benchmark is an alternative choice to fastidiously considered real-world testing.

Hugging Face releases a benchmark for testing generative AI for healthcare tasks

LEAVE A REPLY Cancel reply

Must Read

Google releases technology to watermark AI-generated text

Nuclear energy stocks hit record highs on rising demand for AI

The governor of California has blocked groundbreaking AI security laws. This is why it’s such a very important decision for the longer term of...

Contactless stores set to grow in Europe as Sensei rakes in one other $16 million

AI search start-up Perplexity is targeting an $8 billion valuation in a brand new round of funding

Socket receives recent $40 million to scan software for security vulnerabilities

Cohere adds a vision to its RAG search capabilities

Latest articles

Google releases technology to watermark AI-generated text

Nuclear energy stocks hit record highs on rising demand for AI

The governor of California has blocked groundbreaking AI security laws. This is why it’s such a very important decision for the longer term of...

Our Newsletter

Hugging Face releases a benchmark for testing generative AI for healthcare tasks

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter