I was alerted to this fantastic tweet over the weekend:
Let’s unpack what Wikipedia has to say as an introduction. I agree it is not a “reliable source”, but it is critical to get to “meat of the argument”.
Noise: A Flaw in Human Judgment was authored by psychologist and Nobel Prize in Economics laureate Daniel Kahneman, management consultant and professor Olivier Sibony, and law professor and Holberg Prize laureate Cass Sunstein. They write that ‘noise’ in human judgment presents itself in several forms: disagreement between judges, disagreement within judges, and even in judgments made only once by a person or group, since such a judgment can be viewed as only one possible outcome in a cloud of possible judgments that the judge in question could have arrived at. (Note that the term “judge” in the book denotes any person making an assessment of some kind.)
The reasons given by the authors for why noise arises include cognitive biases, differences in skill, differences in ‘taste’ (preferences) and emotional reactions, mood in the moment, level of fatigue, and group dynamics. The authors consider noise in predictions and evaluations but not in thought processes such as habits and unconscious decisions.
While writing about the “AI and impact on healthcare” sustainedly (for over 3+ years now), it is suffice to see the various patterns in “hype machine” to single them out. For example, the persistent problems in screening mammography, where it is easier to single out the “error rates” and ensuing issues with “false negatives”. This was encapsulated in the paper:
There is a range of at least 40% among US radiologists in their screening sensitivity. There is a range of at least 45% in the rates at which women without breast cancer are recommended for biopsy. As indicated by receiver operating characteristic curve areas, the ability of radiologists to detect cancer mammograms varies by as much as 11%. Our findings indicate that there is wide variability in the accuracy of mammogram interpretation in the population of US radiologists. Current accreditation programs that certify the technical quality of radiographic equipment and images but not the accuracy of the interpretation given to mammograms may not be sufficient to help mammography fully realize its potential to reduce breast cancer mortality.(Arch Intern Med. 1996;156:209-213)
These are serious issues, but is it by any stretch of imagination, an “excuse” to replace the radiologists by AI or ML? First, you’d need a reliable imaging data-set. Then matched the patient set (with the confirmed diagnosis): in thousands to train algorithms. Define the reasonable set of algorithms. Identify resources (which can cost thousands of dollars) to train algorithms and harness computing resources. Flag the “suspicious lesions” for review. Define parameters for “legal responsibility”. Most physicians act altruistically in the interests of patients (and themselves).
It is okay to suggest a biopsy to rule out any disease/suspicious lesion. There are other factors in histopathological interpretation. “Sample density” of the FNAC. IHC controls and interpretation while using the correct titration of the antibodies. Then, there are issues with “wire-localisation” and subsequent “excision” if needed. There are issues with preparing the histopathological blocks, slicing and dicing them, and then subjecting them to microscopic review, and it depends on the pathologists’ determination to scan numerous slides. All this before it goes out for the actual reporting. The “accuracy” depends on a set of people working in the same domain. The linked abstract doesn’t mention anything about the radiologists performing ONLY mammograms, pathologists performing only IHC, or surgeons performing only FNAC of the “suspicious lesions”.
This is just to highlight the complexity of healthcare delivery in just one domain. We are stuck in an endless loop of arguments about the delineation of the Clinical Target Volumes (CTV’s), which haven’t changed much beyond the “2cm margin all around while saving it from OAR’s”. Yes, it is the thumb rule, and there is no yardstick of “overtreatment”, while everyone would wince about the under-treating the area and causing local failures (despite modulation). You cannot wish away statistical failures while minimising them. One of my “failed projects” dealt with defining the “true margins” for glioblastomas, but I became aware of the limitations of the proposal almost immediately. There was a complete lack of reliable data-sets and the horde of other issues (funding being one of them!).
So, coming back to the book. I found a reasonable synopsis of their earlier work in HBR, published in 2016, and the book appears to be an expansion of that:
Professionals in many organizations are assigned arbitrarily to cases: appraisers in credit-rating agencies, physicians in emergency rooms, underwriters of loans and insurance, and others. Organizations expect consistency from these professionals: Identical cases should be treated similarly, if not identically. The problem is that humans are unreliable decision makers; their judgments are strongly influenced by irrelevant factors, such as their current mood, the time since their last meal, and the weather. We call the chance variability of judgments noise. It is an invisible tax on the bottom line of many companies.
In contrast, medical professionals, loan officers, project managers, judges, and executives all make judgment calls, which are guided by informal experience and general principles rather than by rigid rules. And if they don’t reach precisely the same answer that every other person in their role would, that’s acceptable; this is what we mean when we say that a decision is “a matter of judgment.”
None of the authors is from a medical background, so I’d let that pass. It is difficult, if not impossible, to pass sweeping judgements on the complexities of delivering healthcare, where failures are magnified beyond proportion (“ICU’s are the killers!”), without understanding the tacit input of those units, accepting the most critical humans. Likewise, the authors have only passed specific opinionated judgements and it would be wise to ignore.
As again from the HBR:
When pathologists made two assessments of the severity of biopsy results, the correlation between their ratings was only .61 (out of a perfect 1.0), indicating that they made inconsistent diagnoses quite frequently. Judgments made by different people are even more likely to diverge. Research has confirmed that in many tasks, experts’ decisions are highly variable: valuing stocks, appraising real estate, sentencing criminals, evaluating job performance, auditing financial statements, and more. The unavoidable conclusion is that professionals often make decisions that deviate significantly from those of their peers, from their own prior decisions, and from rules that they themselves claim to follow.
They define the “noise” as follows:
The useless variability that we call noise is a different type of error. To appreciate the distinction, think of your bathroom scale. We would say that the scale is biased if its readings are generally either too high or too low. If your weight appears to depend on where you happen to place your feet, the scale is noisy. A scale that consistently underestimates true weight by exactly four pounds is seriously biased but free of noise. A scale that gives two different readings when you step on it twice is noisy. Many errors of measurement arise from a combination of bias and noise.
As always, there is a critique (which is better than the book itself). Noise, it appears, is another term for “subjectivity”.
They offer several killer examples: first, the alarming variation in sentencing by judges in the US for similar cases. This is not the kind caused by bias—such as harsher punishment if you’re poor or black—it’s variation across the board. “Justice is a lottery,” they say. Or take the two psychiatrists who independently reviewed 426 patients in hospital to decide which mental illness they had and agreed only half the time.
But it’s worth taking a moment to ask why we should think of variation in judgment as “noise.” In this case, what exactly is the “signal”? Their answer is that it is we who get in the way of unvarnished truth. You can tell this, they say, because we disagree and, since we disagree, we can’t all be right. That means where there’s variation there’s error. Our varied judgments must be resting on all sorts of wrong-headed, often irrelevant or random influences—noise, in other words—which the research they quote suggests is very much the case. Those dripping taps and neighbours from hell are generated in our own heads.
The critique is excellent, while poking holes in the “Nobel-Prize winning author”, which is banal marketing. It may have a cultural aspect to give importance to titular precedents, but it is important to separate the chaff from the grain. Here’s another compelling argument from the critique:
But with all this qualification, tackling noise becomes rather harder than some of the authors’ more sweeping statements suggest: if studies that purport to identify sources of noise are unreliable, the diagnostic problem is evidently not straightforward; if subjectivity is often inevitable and there is no true signal, we’ll struggle to apply the book’s ideas to many areas of life where judgment varies; if error is asymmetrical, variation might be a lesser evil than convergence. Maybe even the example of conflicting diagnoses of mental illness doesn’t have a right resolution—given what they refer to as our “objective ignorance”—and so forcing uniformity on psychiatrists could be harmful to patients if that uniformity turns out to be uniformly wrong.
One reason I steer clear of “abstractive” concepts (and writing) is because of the subjectivity involved in “reasoning arguments” made herein. It may be fascinating for individuals to “argue from the shadows and dissect the nuances”, but has zero contribution to make processes more “streamlined”. See this in another context and understand that “guidelines replacing the human judgement” is another methodology of “conformity” without benefiting “all the patients”. No trial (or guideline) can account for “all the patients” all the time. There’s no consensus, for example, on “oligometastatic state” while advocating “radical stereotactic therapies” in the process, ignoring the fundamental aspect of palliation: pain and symptom control, and ignoring the simple historical precedents which “worked” (and continue to work). I call this the “sexification” of oncology care (and flawed judgement) of too much “conformity of care”. Human pathos (and judgement) are required to understand the impact of cancer on families (emotional/psychological/financial distress) and as a clinician to guide them in their best interests.
I’d conclude with the critique of the book:
Simpler but less catchy than noise, objective ignorance has a lot to be said for it as the underrated reason so much human judgement and decision-making is poor and hard to improve. The somewhat obvious point that we vary wildly in our judgments because often we don’t know—and maybe can’t know—the reality is quite enough to make the case for more humility about our ability to get our judgments right. The further point that much of the reasoning we invoke to support these judgments is biased and random is largely the subject of Kahneman’s previous work. All that he has really done by calling this “noise” is to point out how our individual cognitive foibles add up to a collective mess. But often the problem is not that our foibles put us in the way of the true signal, it’s that there isn’t one—and so the foibles fill the gap.
It is true that we need to reduce the variability, and calling “subjectivity” as “noise” is branding another aspect of “statistics”. I don’t claim any expertise on either, but this is an opinionated take on it.