Docs are ROCs: a simple fix for a “methodologically indefensible” practice in medical AI studies

By LUKE OAKDEN-RAYNER Anyone who has read my blog or tweets before has probably seen that I have issues with some of the common methods used to analyse the performance of medical machine learning models. In particular, the most commonly reported metrics we use (sensitivity, specificity, F1, accuracy and so on) all systematically underestimate human performance in head to head comparisons against AI models. This makes AI look better than it is, and may be partially responsible for the “implementation gap” that everyone is so concerned about. I’ve just posted a preprint on arxiv titled “Docs are ROCs: A simple off-the-shelf approach for estimating average human performance in diagnostic studies” which provides what I think is a solid solution to this problem, and I thought I would explain in some detail here. Disclaimer: not peer reviewed, content subject to change  A (con)vexing problem When we compare machine learning models to humans, we have a bit of a problem. Which humans? In medical tasks, we typically take the doctor who currently does the task (for example, a radiologist identifying cancer on a CT scan) as proxy for the standard of clinical practice. But doctors aren’t a monolithic group who all give the same answers. Inter-reader variability typically ranges from 15% to 50%, depending on the task. Thus, we usually take as many doctors as we can find and then try to summarise their performance (this is called a mu...
Source: The Health Care Blog - Category: Consumer Health News Authors: Tags: Artificial Intelligence Health Tech AI Radiology Source Type: blogs