One of the main concerns with Artificial Intelligence (AI) in healthcare is the potential for racial and gender bias. This happens because AI-based algorithms are built on data that mirror the unequal societies in which we live. But algorithms are only tools. Depending on how they are made and used, they can introduce problematic bias but also offer ways to reduce healthcare disparities.
For example, health insurers use algorithms to predict who will get sick in the following months or years. These predictions are used to provide targeted groups with more care, in the hope of preventing the expected worsening of health of these individuals and of reducing their medical costs.
Algorithms like this are being applied to more than half the population of the United States. A 2019 study(1) by independent researchers found one of them to be deeply flawed but also explained how this had happened. The researchers then helped fix the problem.
In this case, the (unnamed) healthcare system was running a proprietary predictive algorithm through its data three times a year to generate a list of high-risk patients. A small fraction (3%) of patients with the highest risk scores were fast-tracked into the high-risk care management program that provided extra help with their health. The next 55% were shown to their doctor for consideration to be enrolled. The rest were screened out so they would not be offered access to the program. The algorithm was race blind.
What does bias look like?
It turned out that if hypothetical patients with the same score were followed over the year after the algorithm made its prediction, black patients did far worse than white patients. They had much more deterioration of chronic conditions that caused them to end up in emergency departments and be admitted to the hospital. They had, for example, worse blood pressure, kidney function, haemoglobin A1c (a measure of diabetes control) and anaemia. Everything was worse in black patients compared to counterparts with the same scores who were white, even though the algorithm was viewing them the same way and gating access to high-risk care management in the same way.
The algorithm thus failed the test of bias. Why did it fail and how did this bias occur?
It failed because of how the algorithm was built. In the data set, there was no variable or label for “sick”. The variable chosen by the people who built the algorithm, and others like it(2), was total medical expenditure cost. They looked ahead to see how much the healthcare system spends on an individual and used this as a proxy for how much care they needed. By measure of cost, the algorithm performed well; the subsequent costs for white people and black people both tracked well with the predictions.
The problem is that not everybody who needs healthcare gets healthcare and there are certain groups of people for whom this is particularly true. In the US, those people are disproportionately non-white and socioeconomically disadvantaged. How much care they need is what should have been measured. It was not the algorithm’s fault that costs were lower for black patients than for white patients at the same level of health. The algorithm was mirroring bias in society because of its design. The intention was better care of patients, but the algorithm was set up to predict cost.
Researchers who conducted the study wondered what to do with their results.
Cost as a shorthand for healthcare need was, and is, a widespread health policy practice but it turned out that nobody had thought carefully enough about the problem. The researchers reached out to the company whose risk prediction system they had studied, offering to help them change the way that their algorithm was being built. An improved version of the algorithm to predict outcomes, more based in health than cost, resulted. This approach reduced the number of excess active chronic conditions in blacks, conditional on risk score, by 84%.
Even if you care only about reducing costs, you still wouldn’t predict costs and target all those people.
There are patients with high-cost conditions in whom it is difficult to change outcome, even with knowledge of likely high future cost, for example those with metastatic cancer. People having total knee replacements cost a lot of money in that year, but do not continue to have high costs. In these example, predicting total cost may not help because the real priority is preventable cost, not total cost, for example costs arising from emergency department visits that could have been handled over the phone, hospitalisations that could have been prevented, through medication adjustments, etc.
Failure to formulate the problem clearly
Problem formulation is often an afterthought when algorithms are built. One reason is that it’s difficult to get all the required data and prepare it. By the time this is painstakingly completed, the designers may be keen to just pick a variable and predict it because months have elapsed, and the project has fallen behind schedule.
1. The researchers in this study were able to obtain risk scores and other patient level data, then work with the manufacturer on the problem they had unearthed. This is an unusual circumstance and one reason the paper gained a lot of attention from health systems, technology companies and others interested in evaluating their algorithms for bias.
2. The paper provides a useful lesson for other well-intentioned medical machine learning researchers. Although population-based risk prediction has been around for decades AI is accelerating and expanding these practices and it’s important to build better understanding about how bias gets into algorithms.
3. Algorithms including those using machine learning techniques are already in widespread use and already touching hundreds of millions of lives. The problem of bias exists now and if unaddressed it may become more severe over time.
4. Positive attitudes around the goals and possibilities of these approaches make collaboration and solutions possible and the work more satisfying for all involved.
Are all disparities due to socioeconomic factors?
Another fascinating study(3) by some of the same researchers used machine learning to show that longstanding racial disparities in pain might be better explained by differences that are not psychological, social or socio-economic.
It started with a bet
David Cutler, a health economist, was presenting research on pain associated with knee osteoarthritis, a common cause of pain. The data showed striking disparities in pain across race and socioeconomic classes. It was already known that black patients have more pain than white patients and also that the difference, the gap in pain, persists even when controlled for severity of disease. The finding that some people experience much more pain than others is a kind of injustice not captured in public health statistics.
The Kellgren Lawrence scale is used by radiologists use to grade arthritis. They look at all the compartments of the knee, assign each a grade from 0 to 4, and sum it. If you have two patients, one black and one white, each graded exactly the same, the black patient is likely to report more pain. Cutler’s interpretation, the standard explanation of this anomaly, is that there must be psychosomatic and psychiatric stressors, differences in medical care, things going on “outside the knee” that cause the gap because the extent of damage in the knee is held constant. Cutler’s colleague, Ziad Obermeyer disagreed with this view and laid a bet that these findings are due to something “in the knee” itself.
Building a new algorithm
Using data on pain and race, Kellgren Lawrence grades from the study that Cutler had been using, and actual knee x-ray images, the researchers trained an algorithm to look at the x-rays and predict whether a patient would say their knee hurts or not. Most papers in the literature describe training their algorithms to read a knee x-ray replicating what a doctor would say. The input is the x-ray and the output is the Kellgren Lawrence grade.
Obermeyer’s hypothesis was that radiologists could be missing things in the knee that cause pain. Machine learning algorithms need not replicate the limitations of human knowledge and the errors built into it but can find and use new signals.
The researchers therefore trained their algorithm to listen to the patient instead of replicating what the doctor would have said and thus provide an alternative measure, an alternative set of facts, about the knee. This AI algorithmic pain score captures the predicted amount of pain from the pixels of the knee images.
Who won the bet?
As stated, there’s a pain gap between black and white patients, with black patients experiencing more pain. Controlling for the radiologist’s interpretation of that pain accounts for only 9% of that gap. But by controlling for the new AI algorithm’s interpretation of that x-ray, 43% of the gap, or 4.7× more of the racial disparity in pain, was accounted for, with similar results for lower-income and less-educated patients. The AI-based algorithm did a much better job of explaining pain overall, and a particularly good job of explaining the pain that radiologists miss and that black patients report, some of which can be linked to pixels in the x-ray image of the knee.
1. The researchers concluded that much of the pain that underserved patients report stems from factors within the knee not reflected in standard radiographic measures of severity.
2. The deep learning model does a better job of explaining pain variation than the human model. There are other factors responsible for pain that cannot be linked to certain image findings, but the deep learning model provides tools for looking at x-rays that help provide the best explanation of pain possible.
3. The algorithm’s ability to reduce unexplained disparities came from the racial and socioeconomic diversity of the training set.
4. Because the algorithm-based severity measures better capture underserved patients’ pain, and because severity measures influence treatment decisions, algorithmic predictions could potentially reduce disparities in access to treatments like knee joint replacement surgery.
The duality of AI and algorithms
These two studies illustrate that all technologies have this dual aspect. They can be harmful or beneficial depending on the context in which they’re used and the data they’re trained on. There is a range of possibilities for positive or negative effects with AI and medical machine learning. They’re just tools and tools are neither good nor bad. It’s about how we build and apply them.
1. Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 (6464): 447–453 (2019).
2. Accuracy of Claims-Based Risk Scoring Models | SOA. https://www.soa.org/resources/research-reports/2016/2016-accuracy-claims-based-risk-scoring-models/.
3. Pierson, E., Cutler, D. M., Leskovec, J., Mullainathan, S. & Obermeyer, Z. An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nat. Med. 2021 271 27, 136–140 (2021).