Avoiding the Next Healthcare AI Winter

November 2021 Investor Newsletter

Healthcare Friends –

Today, let’s talk about the Achilles’ heel of healthcare AI — quality standards. Artificial intelligence dramatically alters the way clinicians work, but quality standards for healthcare AI are confusing and disorganized at best, despite billions of dollars having been invested over the past decade. Two out of three health system AI initiatives fail — usually due to underperformance and excessive algorithm noise.

Underperformance stems from limited validation. An algorithm needs to perform well on the data set it was trained on, and that performance must be generalizable across patient populations. Validation on the training data set establishes the baseline, but validation on the health system’s specific patient population is often missed. 

94% of published AI algorithms are not properly validated.

Algorithms that don’t perform well in the real-world harm patients, and the noise related to false positives cause clinicians to pull away from the entire technology category. When this happens too frequently, it causes even responsibly developed and validated solutions to suffer from the same suspicion, leading to a chilling effect on the field.

So as clinicians and health systems grow impatient and disenchanted with AI, how do we avoid the next healthcare AI winter?

Key takeaways:

  • The aftermath of non-validated AI is messy and puts patients and clinicians at risk in unforeseen ways
  • Current AI validation best practices fall short
  • Healthcare AI adoption requires an updated set of metrics

Reactions to Epic

STAT investigation on an Epic algorithm intended to predict sepsis in seriously ill patients revealed that it routinely triggered false alarms and failed to identify the condition in advance. 

Here are the highlights from the article:

  • Researchers at the University of Michigan Medical School tested the algorithm in a retroactive study of over 27,000 Michigan Medical patients that underwent a total of 38,500 hospitalizations.
  • The study found that Epic’s tool only identified 7% of sepsis cases that were missed by a physician while missing 67% of patients with sepsis (high false-negative rate).
  • Epic has paid some health systems as much as $1 million to adopt their algorithms.

But, it’s not entirely Epic’s fault. As far as many providers are concerned, if you don’t have AI, you aren’t relevant. The promises being made by AI companies battling for awareness in a crowded space can be misleading and extreme. Healthcare’s ongoing AI Gold Rush has pressured vendors to churn out algorithms left and right, without sufficient objective review of the algorithm’s performance. This isn’t helped by the lack of standardization of quality for Healthcare AI. 

Non-validated healthcare AI is also worsening healthcare inequality. A recent Science paper found that a UnitedHealth AI algorithm was prioritizing care for healthier white patients over sicker black patients because it was trained to predict healthcare costs over illness. The study found that black patients generate lesser medical expenses because of reduced access to adequate medical care, regardless of their sickness level (measured by the number of chronic conditions).

Current Validation Standards

Leading healthcare AI vendors are trained to continuously produce algorithms, which learn from massive datasets, that help clinicians across diagnosis and peer-review. These are the current metrics algorithm developers are concerned with…

  • The confusion matrix — balancing the true and false positive ratios. Precision, the percentage of total AI flags that are truly positive, is a common concern of busy clinicians. An algorithm that flags many areas to increase its true positive rate (also known as recall) at the expense of its false-positive rate wastes clinician time as they sort through endless mostly-normal results.

  • AUROC — the area under the ROC curve, which is a plot of the true positive rate as a function of the false-positive rate. In the case of AI validation, the higher the AUC, the better the model is performing at distinguishing between the positive and negative classifications.

  • FDA clearance — STAT News report found in an investigation of 161 AI products cleared by the FDA that only 7 publicly disclosed the racial composition of their testing populations, and only 13 provided a gender breakdown. Algorithm developers are currently not required to document how the AI was developed or that its performance was validated on an independent dataset.

    The FDA is working to establish guidelines for developers to provide clear descriptions of training datasets, the relevance of inputs, the logic employed, and the intended role of the output.

  • Peer review research and published studies — most reputable when independently completed and unrelated to the vendor. However, studies that simply verify that an algorithm works as intended struggle to get published. Also, large vendors (like Epic) often have gag orders in their commercial contracts.

While these metrics are highly technical and can verify an algorithm’s accuracy from the dataset it was built on, they carry little water when it comes to the algorithm’s performance in real-world patient care. For example, it is the norm, not the exception, for a health system’s patient population to deviate significantly from the population the algorithm was trained on.

One particularly infamous example of this is IBM Watson’s oncology AI system, which was trained at Memorial Sloan Kettering and then deployed at MD Anderson without considering the significant differences between the two institutions’ oncology practice patterns. You can learn more about the resulting disaster in a multi-part 2017 Stat expose that reads very similarly to this recent 2021 Epic sepsis coverage.

What Metrics Should Be Included

Healthcare already faces more pain points than most industries in adopting machine learning (ML) solutions. Between data security and lack of awareness of available tools, the last thing a health system needs is to realize that the algorithms they finally implemented are wasting clinician time with inaccurate results.

Answering these questions for ML adoption requires metrics that measure the success of algorithm deployment, not just development. So what are the validation metrics they should also be considering?

  • Workflow Impact — How long does it take for providers to use the tool? How easy is it for clinicians to dismiss false positives, confirm true positives, and switch to the AI widget from their current workflow? 

  • Integration and Maintenance — What are the costs of implementing and maintaining this solution? How does it scale? What are the security and privacy considerations?

  • Durability of User Adoption — What percent of providers are actually going to use this? How many will still be using the algorithm in 6-12 months?

    Before they started working with us, our health system partners deployed dozens of different AI solutions in aggregate. Between 50-70% of their AI deployments did not deliver on their promised results and were not renewed.

  • Clinical and Business ROI — Looking beyond current vague metrics like “physician peace of mind”, what is the actual impact this has on our physician productivity? Were there significant changes in care that resulted? How much downstream revenue was generated?

  • Dataset Quality — How different is the training dataset from the patient population at hand? A recent study on the Geographic Distribution of US Cohorts used to train deep learning algorithms found that most of the training data of over 70 publications came from only 3 states: California, New York, and Massachusetts. If an algorithm is highly complex and built from a limited dataset, it likely won’t carry enough value to justify the costs.

  • Local Validation — How does the algorithm’s performance on my local patient population compare to the published performance?

  • Bias and Drift  — Does the algorithm perform equally well across gender, race, scanner type, and clinicians? Does it keep performing at the same level over time?

In the absence of industry, academic, or regulatory rigor in the space, health systems are entirely responsible for verifying that AI is effective for their patient population — a responsibility they readily acknowledge they are woefully unprepared to manage. Some notable thought leaders in the space (aside from Ferrum, of course) include the American College of Radiology AI-LAB platform for ongoing, local evaluation of commercial AI models.

Developing an AI implementation strategy has now become a universal component of health system IT viability. However, without first resolving this fundamental trust issue that divides health systems and AI vendors, I can’t think of a path forward for meaningful AI use in mainstream healthcare.

Until next time,

Pelu Tran
CEO, Ferrum health

Contact Us


ARA Health Specialists

Use the button below to download your free case study and learn how our approach to validation has improved the number of clinically significant findings in AI software.


Sutter Health

Use the button below to download your free case study and learn how our approach to validation has improved the number of clinically significant findings in AI software.