However, it’s hyped, yes. Here’s an opinion piece from Scientific American:
And why does the research say that reported accuracy decreases with increasing data set size? Ideally, the held-out data are never seen by the scientists until the model is completed and fixed. However, scientists may peek at the data, sometimes unintentionally, and modify the model until it yields a high accuracy, a phenomenon known as data leakage. By using the held-out data to modify their model and then to test it, the researchers are virtually guaranteeing the system will correctly predict the held-out data, leading to inflated estimates of the model’s true accuracy. Instead, they need to use new data sets for testing, to see if the model is actually learning and can look at something fairly unfamiliar to come up with the right diagnosis.
The authors offer no real explanation that if a problem like “data-leakage” exists, why is it not reported or avoided? Is the sole purpose of any publication to demonstrate “overoptimistic projections”, then?
Reproducibility crisis in medicine (and biomedical literature) is a well-known problem. Let’s assume a specific algorithm may have an x% of benefit over “no-deployment” at all but requires disproportionate computing resources to show a demonstrable degree of benefit. Will you invest in that line of research (or deploy it commercially?) Would you want to “reproduce” the entire experiment to “prove the claims”? It is not practically feasible.
I am including a comprehensive review from NIST on these specific lines:
I found this instructive in the document:
The authors do have a valid point to make:
We can prevent these issues by being more rigorous about how we validate models and how results are reported in the literature. After determining that development of an AI model is ethical for a particular application, the first question an algorithm designer should ask is “Do we have enough data to model a complex construct like human health?” If the answer is yes, then scientists should spend more time on reliable evaluation of models and less time trying to squeeze every ounce of “accuracy” out of a model. Reliable validation of models begins with ensuring we have representative data. The most challenging problem in AI model development is the design of the training and test data itself. While consumer AI companies opportunistically harvest data, clinical AI models require more care because of the high stakes. Algorithm designers should routinely question the size and composition of the data used to train a model to make sure they are representative of the range of a condition’s presentation and of users’ demographics. All datasets are imperfect in some ways. Researchers should aim to understand the limitations of the data used to train and evaluate models and the implications of these limitations on model performance.
Here’s another intriguing link (from the main link) around the “usefulness of clinical research”:
Practicing doctors and other health care professionals will be familiar with how little of what they find in medical journals is useful. The term “clinical research” is meant to cover all types of investigation that address questions on the treatment, prevention, diagnosis/screening, or prognosis of disease or enhancement and maintenance of health. Experimental intervention studies (clinical trials) are the major design intended to answer such questions, but observational studies may also offer relevant evidence. “Useful clinical research” means that it can lead to a favorable change in decision making (when changes in benefits, harms, cost, and any other impact are considered) either by itself or when integrated with other studies and evidence in systematic reviews, meta-analyses, decision analyses, and guidelines.
- Blue-sky research cannot be easily judged on the basis of practical impact, but clinical research is different and should be useful. It should make a difference for health and disease outcomes or should be undertaken with that as a realistic prospect.
- Many of the features that make clinical research useful can be identified, including those relating to problem base, context placement, information gain, pragmatism, patient centeredness, value for money, feasibility, and transparency.
- Many studies, even in the major general medical journals, do not satisfy these features, and very few studies satisfy most or all of them. Most clinical research therefore fails to be useful not because of its findings but because of its design.
- The forces driving the production and dissemination of nonuseful clinical research are largely identifiable and modifiable.
- Reform is needed. Altering our approach could easily produce more clinical research that is useful, at the same or even at a massively reduced cost.
This does require a deep dive and careful understanding of the nuances. AI in medicine is definitely “hyped” but requires validation before production deployment.