Healthcare AI Algorithm Validation: Challenges and Solutions

 

In the ever-evolving landscape of artificial intelligence (AI) in healthcare, the validation process has emerged as a critical factor in differentiating AI products from each other. Validation of an AI algorithm refers to assessing its efficacy on a sample dataset after the initial training process. The use of biased data not representative of a hospital’s patient population during algorithm training and validation has been a long-standing issue and can affect the downstream performance and reliability of these applications. A recent executive order and guidance document from the Biden administration highlights validation as a key component of the push for federal AI regulation. In this blog post, we delve into the challenges faced by hospitals in testing and validating AI applications and what AI companies can offer to improve the process.

The Challenge of Data Bias in AI Training and Validation:

Vendors often train and validate their algorithms using a homogeneous data sample, which results in excellent performance when the algorithm is used on a similar dataset but subpar performance when used on a diverse dataset. The lack of diversity in training and validation datasets has been a significant concern with some algorithms being only optimized for specific demographics, resulting in significant bias and poor real world performance. The FDA, historically allowing validation on the same dataset used for training, has contributed to this problem; however, there has been recent emphasis to mandate the use of diverse datasets.

Hospitals’ Struggle with Validation:

Hospitals face a significant challenge in determining what AI applications to trust. Peer review and literature review have been traditional methods of assessing AI algorithm quality, but they come with limitations. Some hospitals resort to using their own patient data to validate the AI’s performance, but obtaining labeled clean and relevant data can be a  considerable challenge. The manual effort required, such as collecting patient information and running algorithms on a limited dataset, is not ideal and requires a large amount of hospital resources.

Competitors’ Approaches and Limitations:

Several competitors in the market offer solutions for AI validation, employing model operations analytics layers to graphically visualize algorithm performance. However, many of these solutions fall short of providing a comprehensive validation process. Visualization tools and analytics dashboards may offer insights but often lack the depth required to establish ground truth and truly evaluate whether the AI algorithm is functioning at a high level. This leaves hospitals questioning the accuracy of the results and the reliability of the AI applications.

Ferrum’s Advantage:

Ferrum stands out in the competitive landscape by introducing a groundbreaking approach to AI validation. Unlike competitors relying on manual processes or analytics layers, Ferrum leverages model operations in conjunction with GPT, an advanced language model, to provide an automated assessment of ground truth on every scan that the algorithm is run on. This integration allows Ferrum to automatically validate AI applications using the hospitals’ own patient data, eliminating the need for extensive manual work and hospital resources.

The Ferrum system can establish a ground truth diagnosis by analyzing the finalized radiology report using a GPT language model and comparing them with the AI’s predictions. Discrepancies between the language classifier and visual classifier are then flagged and collected as data to be reviewed by Ferrum’s groups of in-house radiologists. This automated process enables Ferrum to provide hospitals with insights into the performance of AI tools across various demographics to proactively identify deficiencies. The use of these models ensures that every study in a hospital can be analyzed, offering a more robust validation process and real-time opportunity for post-deployment monitoring and reassessment.

Vendor Neutrality and Continuous Monitoring:

Ferrum maintains vendor neutrality, providing flexibility that allows the platform to identify the best individualized solutions for different hospitals and enhancing the overall reliability of the validation process. Moreover, Ferrum is then able to continually monitor its tools without specific vendor bias to ensure optimal performance, providing hospitals with a sense of confidence and trust in the validation results. Current regulations prevent “cross-pollination” of algorithm validation data across different health systems. However, given the diverse populations that each hospital serves, siloing validation data within a health system may actually serve to improve accuracy and reliability.

Conclusion:

 

In the rapidly advancing field of AI in healthcare, the validation aspect becomes increasingly important to ensure that these AI algorithms provide a clinical benefit. Ferrum’s innovative approach to validation sets a new standard. By combining model operations with GPT, Ferrum not only addresses the limitations of current validation methods but can also streamline the process for hospitals. As the healthcare industry increasingly relies on AI applications, Ferrum’s vendor neutrality and commitment to providing accurate and reliable validation results marks a significant leap forward in ensuring the quality and effectiveness of AI in healthcare.

Varun Danda

Varun Danda

Varun Danda, MD, is an integrated interventional radiology resident at the Icahn School of Medicine at Mount Sinai, NY. He has a passion for medical technology, focusing on artificial intelligence, augmented reality, and medical devices in radiology.

Contact Us

CASE STUDY

ARA Health Specialists

Use the button below to download your free case study and learn how our approach to validation has improved the number of clinically significant findings in AI software.

CASE STUDY

Sutter Health

Use the button below to download your free case study and learn how our approach to validation has improved the number of clinically significant findings in AI software.