The Rocky Road of Standardization in AI Healthcare: Making Signal from Noise

The Rocky Road of Standardization in AI Healthcare: Making Signal from Noise
 
Screening AI tools have shown incredible promise in cancer detection, with the potential to reduce radiologist workload and improve patient outcomes. However, the lack of clarity on their quality metrics and standards hinders their widespread uptake and reliability in their recommendations. For AI to be considered an evidence-based “gold standard” in conjunction with the expert radiologist’s opinion, we have to look towards standardization of AI quality. In this week’s Domain Knowledge, we explore the need for quality standardization, what it means, and how it can be best established in the ever-changing field of healthcare AI. We will hear from Leon Bergen, an expert in natural language processing, Professor at UCSD.

Despite the need for standards, there is a difference in what quality means to every institution and vendor.  

Here’s how Leon Bergen suggests quality can be defined for your institution:

  • Use-case scenarios: Understanding the specific use-case scenarios for the AI tool is essential in defining the required diagnostic accuracy or quality standards. For example, the cost of errors differs between a triage tool or a screening mammography tool vs. a second reader (decision support). Higher accuracy thresholds need to be met depending on the AI tool’s role in the decision-making process.

    AI system performance may differ across use cases and sites. The IBM Watson oncology tool, deployed at MD Anderson and MSKCC, was discovered to have variable performance in extracting information from medical records, depending on the exact task.

  • Cost-benefit analysis: The use of AI tools involves a significant expense, and a group decision is needed based on the resources available to a given institution. It is important to first look at baseline rates of both false positives, which can lead to unnecessary use of resources like biopsies for patients with BIRADS 2 lesions, and false negatives, which may be serious or overlooked errors.

    Comparing these existing error rates to the improvements the AI model can bring is a good way to begin assessing whether the cost of using AI is worthwhile. Institutions must set their own acceptable level of improvement, taking into account their financial resources, budget limits, and the level of accuracy needed for their specific situation. This approach ensures the most efficient use of resources and the best possible outcomes from AI deployment.
  • Continuous process improvement: By observing inter-site and intra-site variations over a period, we can discern patterns that might otherwise be obscured in a standalone snapshot of data. This could involve changes in diagnostic accuracy, varying effects of certain procedures, or fluctuations in resource use. Recognizing these trends allows an institution to understand the overall trajectory of its operations and facilitates more informed decision-making. Continuous monitoring also serves as an early warning system, enabling preventative action. For instance, an increase in false positives or negatives over time could signal a need for recalibrating equipment or revising procedures. This approach saves resources and ensures the continued delivery of high-quality care.

How can standardization be initiated?

Standardization exists, in medical records, for items like diagnosis, and exams in the form of ICD and CPT codes. However, standardization must be initiated for all electronic medical record (EMR) terminology, which in turn, standardizes the input data. AI tools have the potential in their ability to analyze and see trends/patterns in big data, i.e., highlight root causes and identify unseen factors. If patient demographics, data collection, and other terminology in patient medical records or imaging exams (metadata, DICOM tags) are standardized, then the redundancy in getting the AI up to speed on these terms is reduced, and much more data can be analyzed across institutes and patient populations, fostering collaborative networks and large databases.

AI vendors should work towards incorporating data from varied patient populations in diverse geographical regions during the training phase and also work towards setting up cloud collaboration for patient datasets. AI’s ability to delineate population indicators is supreme and must be utilized. 

What does this mean for the future of AI in healthcare?

Standardization of healthcare in collaboration with AI can provide equitable care to diverse patient populations, overcoming barriers such as access and resources/funding. 

Regulations help protect patient privacy, establish ethical guidelines, and safeguard against potential biases or discrimination. AI software comes under FDA regulations currently, but there are no specific imaging regulatory bodies or laws that currently monitor AI software. 

Core steps to setting up regulations:

  • Funding- It is important to evaluate the funding available to integrate AI tools in the workspace. 
  • Patient consent to data use for all AI algorithm development purposes- As always, stringent privacy regulations (HIPAA) in healthcare may be limiting the potential of AI. Obtaining explicit patient consent for “digital trials” using AI models may help in acquiring relevant protected health information (PHI) in order to better train models. 
  • Regulations must enforce total transparency and upfront communication between patients and providers using AI tools. 
  • Comparison of performance across institutions- Collaborative networks with de-identified data or generated datasets can overcome the inter-institutional differences in AI tool performance. 
  • Standardization of reporting outcomes/performance metrics- Regulations must be set up against “false advertising” employing reporting bias wherein institutions may report their institution-specific AI tool to have superior performance in contrast with the contextual performance comparable to other institutions. 
  • There could be a minimum threshold percentage of diversity in training data, which could be the ratio of the data population by gender/ethnicity/race in proportion to the actual percentage in the general local/target population. This ensures that the developed tools meet minimum standards for their training dataset, which in turn improves the performance when employed in a similar test population. 

Quality standard guidelines being set down by the relevant regulatory bodies should serve as a jumping-off point but not the endpoint. Institutional and local regulatory bodies should work in collaboration with their AI vendors to integrate these regulations into their workflow and systems. “Ask not what AI can do for you, but what you can do for AI.” 

Here are some questions to start a discussion with your AI vendor about setting up quality standards for your institution :

  • How do you validate and test your AI algorithms to ensure consistent performance across different patient populations and sites?
  • How do you update your models and validate these updates?
  • How do you deploy updates to end users?
  • How do you seek and receive feedback from end users to ensure that your AI tool meets their broad needs and integrates into their workflow?
  • How do you address data privacy and security concerns in accordance with current applicable regulations?
Picture of Siddhi Hegde

Siddhi Hegde

Research Fellow and aspiring radiologist exploring new technologies in patient care.

Contact Us

CASE STUDY

ARA Health Specialists

Use the button below to download your free case study and learn how our approach to validation has improved the number of clinically significant findings in AI software.

CASE STUDY

Sutter Health

Use the button below to download your free case study and learn how our approach to validation has improved the number of clinically significant findings in AI software.