Black-Box Medicine: Legal and Ethical Issues

SPEAKER 1: Welcome, everybody. Usually in January, I commend
people for making it out despite the winter weather. But it feels a little more
like March or April out there. So you get no bonus
points for attending. But nevertheless, this is
an immensely exciting topic. I’m unsurprised that
the room is full. Although if people
are looking for seats, there’s still a few
up in the front. Black-box medicine
has enormous potential to shape health
care for the better by improving and aiding
many medical tasks. What we’re talking about
in black-box medicine are machine learning algorithms
and artificial intelligence, which examine a variety of
different health care data sources, including genomic
sequencing, patient records, and the result of
diagnostic tests to make predictions and
recommendations about care. An algorithm is called
black-box because either, it is based on unknowable
machine learning techniques or because the
relationships that it draws are too complex for
explicit understanding. Now that elicits a
knee jerk reaction, I think in most people. Certainly it does in myself,
probably in you as well. And so one of the big
questions confronting us when we look at the
use of this tool is that we don’t like the
fact that we can’t check the machines’ work
because there’s something inherently wrong
or problematic about that. Or is it just discomfort
with a new technology that we will get over? Certainly the stakes are
high for figuring out black-box medicine
and other applications of artificial intelligence
in health care. We’ve made significant
progress in areas such as radiology and oncology. And tech can’t solve everything. But it can help extend
our limited resources to make better use of
health care spending. So for example, IBM
Watson for Oncology is a machine
learning system that intends to help
clinicians quickly identify essential information
in patients’ medical records and explore treatment
options for 13 different types of cancer. And this is good because we
only have limited human hours. And so the machines can make
our physicians or practitioners more effective. But there are serious
concerns into incorporating AI into health care,
which our speakers will address and unpack. Some of them are that the
algorithms are really only as good as the data
that they are fed. And our data sources, like
most things in our society, reflect some inherent biases. So recently I was talking
to somebody who is building AI for breast cancer detection. And she was telling me about
a popular algorithm that is in use, that in
African-American women is actually worse than
random at predicting what the ideal treatment should be. When she asked the
creator why is it worse than random in this population? He said, well,
this algorithm was created for white British women
not for African-American women. It’s a different population. It’s different data. So the algorithm
really would need to be retrained on this data. It can also reflect
other inherent biases, such as ones around disability. How do we tell the algorithm
to value people who are differently
abled, when in society we sometimes struggle
to do that ourselves? Then there’s the question
of how great the data is. Watson For Oncology famously has
been using synthetic data which has been very controversial. Some people have suggested
that the results of Watson For Oncology are not
optimal or should not even be used because the algorithm
was trained on synthetic data. These questions are only going
to get even more challenging as we move away from the
initial, easy applications, such as radiology and oncology
into more and more difficult applications. There’s also the question
of how should clinicians incorporate AI and black-box
medicine into their practice? Right now we
consider these tools to really be decision aids. They don’t tell physicians
what they should do, but they suggest,
hey, maybe you want to give this patient
a CT scan instead of an ultrasound or vise versa. But very quickly, we’re
going to get to the point where these tools are
not going to be decision aids but decision makers. How do we train
physicians to weigh these tools appropriately? What do we do when the
AI recommend something different than what a physician
with years of experience would do? And how do we train
the next generation of physicians who are going to
come of age with these tools and techniques firmly
embedded into their practice? Before I introduce
our speakers, I want to say if today’s
topic is interesting to you, then you should
check out our project on Precision Medicine,
Artificial Intelligence, and the Law that we host here
at the Petrie-Flom Center. PMAIL, as we like to call
it, is a comparative analysis of the laws and regulations and
incentives around black-box, and more broadly artificial
intelligence and health care, both in the US and in
the European Union. We’re trying to answer
some of the questions that I flagged and other
questions, such as exactly where does the
liability lie when we have broadly implemented AI? So please keep your eyes
open for further things coming out of PMAIL as well
as other events publications, and reach out if you find
this particularly interesting. Now, in my initial
presentation, I have flagged a whole
bunch of interesting but difficult to
answer questions. And as moderator, I have
the luxury of not needing to answer a single question. Instead, the two people
who will hopefully tackle these
questions and provide some answers, or
at least suggest how we should be thinking about
answering these questions, have joined me on the dais. So first up will be
Professor Glenn Cohen. Glenn, I have not memorized
your bio, although at this point I probably should have. But he is the James A Atwood
and Leslie Williams Professor of Law at Harvard Law School,
as well as the Faculty Director of the Petrie-Flom Center. He is the author of
more than 100 articles and chapters and his
award winning work has appeared in a
variety of venues, from legal, including the
Stanford, Cornell, and Southern California Law Reviews, medical,
including The New England Journal of Medicine and
JAMA, in bioethics, including the American Journal of
Bioethics and the Hastings Center Report, not to mention
scientific, such as Cell and Nature, in public
health, such as the American Journal of Public Health. He is also the author,
co-author, and editor or co-editor of 12 books. Joining him is
Ziad Obermeyer, who is a physician and research,
who works at the intersection of machine learning and health. His research seeks to
understand and improve decision making in public policy
and clinical medicine and drive innovations in
health care research. His work has been published
in Science, The New England Journal of Medicine,
JAMA, The British Medical Journal, and Health Affairs. He is the recipient of
an early independence award from the office of
the director of the National Institutes of Health and
the Young Investigator Award from the Society for
Academic Emergency Medicine. His research is supported
by the National Institutes of Health, the
Robert Wood Johnson Foundation, the World Bank,
and the Laura and John Arnold Foundation. So please join me in
welcoming our two speakers. [APPLAUSE] GLENN COHEN: I’m going to start
with a joke I’ve told before, which is there’s so
much hype in this area. People talk about
big data machine learning the way
adolescent boys talk about sex in the locker room. Everybody says they’re doing
it and they say that they’re doing it all the time. But the truth of the
matter is, many of them don’t quite
understand what it is. If they are doing
it, they’re doing it not very well, and
certainly not a way that both parties
to the interaction enjoy or benefit from. So we want to try to
cut beyond the hype. But to start, I’m not
going to be promoting any unlabeled, unapproved
uses of drugs, devices, et cetera, et cetera. So I thought we should
start with a little bit of vocabulary. Big data has become the
ubiquitous watchword of our generation. What does it actually mean? It’s thought to have at
least three Vs to it. The first is volume,
vast amounts of data. The second is variety,
significant heterogeneity in the data. So genetic results,
patient records, maybe even shopping habits. The third is
velocity, the ability to process the information
at an incredibly rapid speed. And the fourth, which is
sometimes there and sometimes not, it’s
aspirational, is value. It’s going to add value to
the way we do health care. Predictive analytics
is something we want to do with big data. It’s the use of
electronic algorithms that forecast future
events in real time. And it’s intersecting with
the law in a myriad way from how votes are
counted and how voter rolls are
revisited, how to target taxpayers for auditing. But we’re mostly interested
today in the health care implications. Much of the care we have today
in America and around the world remains relatively
untracked and underanalyzed. We don’t know whether most
of the treatments we give are effective. Many have proved
to be ineffective. We don’t know what works
and what doesn’t work. Big data can help. We can leverage patient
records, patient data, including data outside of
the health care encounter with the idea of
improving patient care. This can be used in terms of
improving actual techniques we use in terms of
pharmacovigilance, in terms of trying
to figure out what products should be
taken off the market, dot, dot, dot, dot, dot. Machine learning is a subset
of artificial intelligence. It’s a method of
data analysis that attempts to find
structure or a pattern within a data set without
human intervention. And we usually classify
machine learning algorithms into supervised and
unsupervised algorithms. In supervised
learning, the system is presented with example inputs
along with the desired outputs. And the system tries to
derive a general rule that maps one onto the other. So for example, the
identification of diagnoses by symptoms. In contrast, in unsupervised
machine learning, no desired outputs are
given and the system is left to find patterns independently. For example, clustering
and anomaly detection. One more distinction
vocabulary wise to raise before we jump into the
legal and ethical issues, and that is the
notion of a centaur. You might say, wait a
second, that’s classics, that’s Greek mythology. Why are you talking
about centaurs? You’ll remember that the centaur
is a hybrid of human and horse. The future of AI
is unlikely to be systems running
autonomously, but instead the partnering of
human ingenuity and human knowledge
using AI systems, much like the centaurs. Because it turns out, as we
know from examples from chess and Go and the like, while
AI systems often beat humans, humans plus AI often
beat merely AI. So the goal we’re striving
towards for the most part are these kinds of centaurs. OK. Let me start with
some sunshine, which is current and near
future applications. These are just a
couple of examples. In 2017, FDA approved Arterys’s
Medical Imaging Cloud AI as the first machine learning
application that could be used in a clinical setting. Originally approved
for cardiac images, but it’s also gotten FDA
clearance for oncology imaging. And the company’s new
clinical offerings include lung AI and liver
AI to help radiologists quickly measure and track
tumors or potential cancer through MRI and CT scans. Research from Stanford
published in Nature, I think that was
a couple of years ago on AI that was
trained to classify images of skin
lesions, recognizing over 2,000 diseases. Another example up here is
IDXDR, the first AI diagnostic that provides a
screening decision that can be used without the
oversight of a clinician to interpret the image or
result. It’s a AI based device. It got FDA approval
in April 2018 to detect the eye disease
diabetic retinopathy in diabetic patients. Recently researchers from more
fields, Eye Hospital in London and other people,
including DeepMind published in Nature
and Medicine a study showing that the AI system
could read complex eye scans and make referral
recommendations comprising more than 50 diagnoses. And the system was
trained on 14,884 scans and showed a
success rate of 94%. In reality, the
future I think, can be divided in two columns, two
ways of dividing the world, although I’m always cognizant
of that Tom Robbins quote that there are two types of
people in the world, those who think that there
are two types and those who know better. So I don’t want to suggest
that this is exhaustive. But I do want to suggest
this is perhaps helpful. One is by function. Where do we see AI being used? Imaging has been the
low hanging fruit. We have tons of data. It’s easily machine readable. But in the future we’ll
see more prognostics, predicting what health needs
will be for particular patients in the future. Diagnostics, determining what is
wrong with a particular patient who has symptoms, and treatment,
using AI to suggest which among a set of possible
treatment possibilities will be most successful. But a different and
complementary way of thinking about this is
related to the purpose. Why are we turning to AI? One of the most
important contributions will be the democratization
of expertise. Everybody could get access
to a good dermatologist through the AI, rather than
waiting for a specialist referral, or perhaps in
a developing country not having specialists available to
large swaths of the population. They might be able
to automate drudgery. Who wants to do their
billing as a dermatologist? Let’s free up the time
the dermatologist spends on that by having the AI handle
the drudgery part of the work. Optimize resource. Choose who among a
patient population sees the dermatologist
with priority. Do triage. And finally pushing
frontiers, improving how dermatologists
perform their jobs, computer aided dermatology. Now most of the public
discussion of AI focuses on that last
category, pushing frontiers. But most, or at least the
short and middle term value proposition, is actually going
to be in the other categories. So let me just make this
tangible with two use cases very briefly. One, based loosely on
Watson for Oncology is a decision of what
chemotherapeutic to apply. So imagine you’re trying to
decide which of a set of cancer chemotherapeutics to apply. You look at your
patients EHR record and built into that
is a module that helps you make some decisions. It uses not only the
information from the EHR, not only genetic sequencing of
the patient and of the tumor, but it also links up to consumer
data about that patient’s shopping habits,
location, age, demography, and compares it to the
same records for millions of other patients who have been
de-identified into the system. That is where we are
rapidly going towards. Perhaps a slightly
more far flung but also ethically interesting
question would be, instead imagine you’re
trying to determine who to admit to the ICU. You could do a workup. It would take you several hours. It would have relatively
limited inter-rater reliability but you could do that. Imagine instead you had a system
that could do it continuously every seven minutes
for every patient in the hospital who is
eligible for the ICU to help you make immediate
decisions constantly being updated. Now this one is interesting
because the AI is not only telling us which of several
therapeutics to give, but also helping us decide how
to prioritize between patients. This is where we want to go. OK. So that’s the sunshine. Now I’m an ethicist and lawyer. Let me bring you to the clouds. This is my master slide of all
the ethical and legal issues you’re going to face
in trying to do this. And you know, this is
a little bit like me. I’m looking at Ziad for a
moment because he actually does this stuff. This is like a philosopher
talking to their mechanic about how they should think
about the engine, right? The zen of the engine, right? So I agree with that. That’s totally fine. He’ll tell you how
they actually do it. But I’m going to tell you how
we should conceptualize this from a legal and
ethical perspective. Phase one. Where do you get the data from? How do you acquire the data? Among the issues, do
patients like you and I have to consent to the
use of our EHR data for these purposes? Is that enough to sign a front
door consent that none of us read when we enter the
hospital for the first time or does there have to be
a more specific consent? Is it enough to consent
for all purposes or does there have to be
details about the purposes they want to put the data to? Do I have to be
reconsented for that? Is notice enough? How representative
is the data set that’s going to be involved? Does it represent
only the patients that go to the Brigham
and Women’s Hospital or does it represent patients
from all over America? Ziad is going to spend
considerable time talking about the problem of
bias here, and the extent to which statistical corrections
can be made on the back end. If they can’t be,
how do we incentivize better, more robust, more
diverse data collection? What role for patients in
the governance of this data? Should patients be empowered
to make these decisions, if not at an individual level,
at a collective level? Is the right way to think
about this like a union or maybe like a trust, like
a will with fiduciary duties? OK you’ve got your data. Now you’re building and
validating your model. How do you know your model
is good enough to actually be used on real patients? What is the standards
of validation that we put in place? Are we going to have risk
classification or are we going to have more
rigorous robust classification for higher risk use of
algorithms versus lower? And who’s going to be
doing that classification? Here we have a real tension
between transparency on the one hand and trade
secrets on the other. In an ideal world, as
patients and as regulators, we want the data, we
want the training data and also the algorithms to
be as transparent as possible so that anybody can
go under the hood. But in a world like
in the US and the EU, where the algorithms are not
themselves patent protected, most companies will
turn to trade secrecy as an intellectual
property strategy to protect their investment. Trade secrecy is very
difficult to reconcile with accountability,
auditing, and transparency. Can we do this? Are we to look to
third party auditing under contractual
sort of arrangements? Is that the solution here? And what kind of agency is going
to be empowered to do this? FDA has traditionally been
very allergic to software. It’s something they don’t
touch and very allergic to the practice of medicine. It’s something they
say they’re not about. Is this going to force
an agency like FDA to get into those areas and
will they be successful? Phase three. You’ve got a validated
algorithm based on patient data. You want to test it
in the real world. Do you as a patient
have a right to be told that your care is being
informed by, or maybe partially directed by, an AI? Is that part of what we
expect in the informed consent process? Can patients opt out? Does it matter
whether the opting out is about a resource
allocation question as opposed to a
treatment recommendation? What about when the AI is
actually directing the way the hospital is staffed? How often to send a physician
versus physician’s aide around? How quickly to process
certain kinds of test results? Are those the kinds of
things that we think patients have a right to know? Most patients currently
know zero about how those decisions are being made. Is it different because
an AI is involved or are we guilty of a kind
of AI exceptionalism here? We’ll discuss in a
moment the questions about liability, who pays for
when something goes wrong, as it will. There are also difficult
questions about model implementation
versus model design, questions about
choice . architecture. So for example, how
many alerts how many overrides are
necessary and how do we configure the item this way? Once again, what regulator
or combination of regulators is best suited to
this kind of work? Phase four. You’ve got a model. It actually works
in real patients and is showing improvements. Now you want to engage
in broad dissemination. But there’s a problem. This might cost money. Will hospitals pay for this? Will insurers pay for this? Will the government
pay for this? Is there an obligation
if these models are based on the
data of patients all across America to have the
model be available to patients all across America? And how do you achieve that? Are there obligations
towards graduated licensing? So I want to just delve
into one or two more of these in greater depth and
then I’m going to wrap up. The first is the question
about safety and effectiveness. As Carmella alluded to,
Memorial Sloan Kettering has given a little bit
and Watson Oncology more specifically,
I should say here, has given some people
some cause for concern, based on the Watson for Oncology
and the way it was implemented and this question of
synthetic controls. And something I think we should
talk about during the Q&A if Ziad is not going to talk
about it in his own slides, is the question about
whether using synthetic cases makes sense here as opposed to
real cases from training data. But this example
shows us that there is a real possibility that
something is going to go wrong. When something goes
wrong, who pays for it? There’s a few possible
model designs here. We’ve got malpractice. That’s what we use
currently with doctors. But in fact, the
current liability regime says if you are a physician that
relies on a computer decision aid and the computer
decision aid has the error, you nonetheless pay for it. You pay the bill. Is that problematic in a world
where physicians not only rely on these AI medical
devices but can’t understand them
because they are based on largely black-box model? Should we be thinking
about moving to something like a vaccine
compensation fund, which pays out regardless of fault. Or should we have a
system of preclearance by an agency like
FDA, the idea being once you achieve that
kind of clearance, you are immunized for liability? How do we allocate
liability between the makers of algorithms, the hospitals
that purchase them, the physicians who rely on
them, and the insurers who pay for them? OK. And this is my last
slide, privacy. In the US as opposed
to Europe, we have relied on a system
of privacy protection that is custodian and
sectoral specific. HIPPA, which many
health professionals think is actually
a four letter word, is our main statute that
protects health privacy here. But it protects covered entities
and those business associates and only certain kinds of data. We are rapidly
getting to a future where most of the
data that’s going to allow us to make inferences
about your health and my health is not above the waterline. It’s not the stuff generated
by your visits to the doctor, but all the stuff below the
waterline, your purchasing habits at Target, your Fitbit
results, your social media, how you spend your time,
your Google searches, right? Turns out there’s some
interesting studies showing that Instagram
filter choices are excellent predictors
of depression, for example. In a world where, it
turns out, our statutes and our whole scheme is focused
on what’s above the waterline, what happens to all that
data below the waterline and how do we regulate it? So with that, I’m
going to turn it over to my friend, Dr. Obermeyer. And thank you very much. And thank you to the funders. [APPLAUSE] ZIAD OBERMEYER: Thanks
so much for having me. I want to start
off with a thought experiment for you guys. So imagine that the FDA has
approved a new technology. Imagine that this technology
is made by a large multibillion dollar for profit company. And doctors all around the
country, after it’s approved, just start using it overnight. And we don’t really
think they have much of a clue how it works. But it’s starting to affect the
lives of millions of people. And we actually have
no idea, whether we’re doctors or policymakers or
patients, how this thing works. And I think that’s a very
concerning scenario to most people with good reason. But I’ll tell you one odd thing
about this thought experiment, is that this thought experiment
describes many, many drugs that are in use today. So we have very little idea
how most antidepressants work, how statins might work
to prevent heart disease, how Metformin works
to reduce the risk of diabetic complications. It also, for those of you who
took Tylenol earlier today, describes how we
think about Tylenol, which is that we literally
have no idea how it works. And if you start looking
at all of medicine through the eyes of like, oh,
do we really understand this? When you go get an MRI of
your knee, how well versed do you think your doctor
is in quantum physics? Do you think they
really understand the Lorentz forces and the spin
dynamics of different tissue? No. And I think if you again,
follow this line of reasoning, the sounds completely insane. This sounds like
a crazy situation that all around us
and our medical system we’re just surrounded
by black-boxes. So it does sound crazy. And I think what I’ll
try to convince you of in the next few minutes
is that it’s not so crazy. And that, in fact,
we do not need to open these
black-boxes to make a very profitable use of them
in society and in medicine. The key point is
that we can actually, without opening the box,
look at the inputs to the box and look at the
outputs of the box. And for drugs, we do
this all the time, even for drugs whose
mechanisms we don’t understand, we can put them through
randomized trials and look at the health
benefits that they do or do not generate. What would that look
like for algorithms? I think a good starting
point would also be randomized
trials of algorithms compared to usual care. It’s not a bad start. But since that’s not being
done, and since many algorithms are actually out of the
box already and being used in society, there
are some other methods that we can bring to bear. And those are generally
categorized as audits of algorithmic outputs. And so what I want to do
for the next few minutes is just take you through
a concrete example of one of these things. Because I think it’s
sometimes easier to engage with some
concrete examples than with abstract ones. And so we’re going to consider
an algorithm that’s actually in wide use today that’s
affecting tens of millions of patients around the
world and the health care decisions that are being
made for them as we speak. And I’ll just make
a broader point that Glenn mentioned a
lot of areas in medicine where people are very
optimistic about what algorithms are going to be doing. In a lot of health
care delivery settings, in accountable care
organizations, and Medicare Advantage plans,
algorithms are already being used at scale today. And I think this is very
unlike the promise of AI. This is actually happening now. And we have very little
understanding of how these things are working. So that’s why there is some
urgency in figuring out how we deal with these kinds of
algorithms that are at scale, live, being used today. So the particular algorithm
I’ll tell you about is an algorithm that’s
going to be used, and being used, to predict
which patients are going to have health needs in the future. The reason we want to do this is
because a lot of health systems have invested heavily in what
are called care coordination programs. So these are a little
bit like almost VIP programs for patients that
have a lot of health needs. So you get access to a
specially trained provider, usually a nurse
practitioner that’s on call for you and your needs. You have help in filling your
prescription medications, in getting an appointment
to see your doctor. And the idea is that
if we can deliver to those very high risk,
high need patients, care when they need it, before
their health needs deteriorate, not only will we deliver
a better quality care, but we’ll also
bend the cost curve and start reducing all of the
in hospital emergency care needs that these patients
generate when their health needs aren’t met in real time. OK. So one key part of
the program, which you might have
already guessed is that these are very high
touch, expensive programs. So this is having a
human, a trained human, on call for you
whenever you need it, these things are expensive and
you can’t do it for everyone. So you need to target these
programs to the people who are going to need them the most. Makes sense. And this is a massive commercial
market for algorithms. Pretty much every big health
care analytics company, IBM Watson, et cetera,
et cetera, et cetera, has a product in this space
that’s being sold and used by accountable care
organizations and other health care systems today. So what we do is we
obtain these predictions from a particular
commercial algorithm. It’s the largest
algorithm in use today, but this is really
representative of how the entire industry does this. And we get those data for a
large primary care population. It’s about 45,000
patients, fairly diverse. And what we get for
that algorithm is we get the inputs that
go into the algorithm and then we get the
algorithmic predictions that come out of it. And then we’re able
to do work looking at how those algorithmic
outputs link to real outcomes because we have those data
linked to electronic health record data from
the health system where these predictions
are going to be used. So this is just some
summary statistics about the patient population. I’ve broken it out by race. So for white and black
patients separately, because there’s one thing that
we found very striking when we started evaluating how this
algorithm worked in practice. And I’m going to show
you data on the number of chronic medical
conditions that patients have in this population. So here’s what it looks like. So, on the x-axis, and I
apologize for the graphs. Yeah, I’m going to tell
you about the axes. But you can kind of tune out
if that’s not your thing. On the x-axis is the
algorithms prediction, ordered from left to right
in order of increasing risk. On the y-axis is the number
of chronic medical conditions patients have at a given level
of algorithmic prediction. And the two lines are
showing you white patients versus black patients. And what you can see is
that at any given threshold of algorithm predictions, black
patients have far worse health than white patients. So that’s interesting because
how are these predictions being used well as I told you they’re
being used to guide decisions about who gets into these
programs that give you extra help and extra medications
and extra appointments with your primary
care physicians. So now you can imagine
what’s happening. At a given level of
risk, white patients are healthier than
black patients. But they’re being
treated the same way for the purposes of
getting into this program that we think is very helpful. How consequential
is that difference? You can perform a simulation
exercise where you essentially swap patients in and
out of the program until white and black patients
have the same level of risk, like the lowest risk patient
has the same level of health. And if you simulate
addressing that inequality, you would actually double the
fraction of black patients who are automatically
enrolled in this program. So this is a fairly
consequential level of bias that has the potential
to impact real decisions in a very significant way. So I’ve been showing
you these data, which is how many chronic medical
conditions the patients have at any given level
of algorithmic risk. You can basically do this
for any measure of health. And we have a lot of
different measures of health from the electronic
health record data. This is blood pressure. So if you look at black
versus white patients these differences are– to scale them by mortality
risk or heart attack risk, these are big, big differences. Hemoglobin A1C as a measure
of diabetic control, blacks have much higher levels. Lung function is worse. Red blood cell counts are lower,
suggesting more severe anemia. Kidney function is worse. So pretty much
any health measure that you look at a level
of the algorithmic risk score, blacks have far
worse health than whites, with the consequences that
you can imagine in terms of getting into this program. So the obvious question
is, of course, where’s the algorithm going wrong? How is this bias creeping in. So there are lots
of explanations. Glen cited a few in terms
of the training population and not extending to
other diverse populations. I’m going to tell you about
a very different mechanism. So one clue to what’s
going wrong here is where the algorithm
is going right. So, here’s another graph. This is again algorithmic
risk scores on the x-axis. But what I’m showing
you on the y-axis here is the total medical
expenditures of a given patient in next year’s fiscal year. And you can see here,
there is zero bias. So the algorithm is
predicting cost perfectly. And it’s doing so perfectly
for both blacks and whites. And so here’s the problem
with predicting costs. Black and white
patients don’t have the same relationship between
health and health costs. So you can look at
a level, and this holds for any measure of health,
now health is on the x-axis. So now, if you draw
a vertical line, you’re looking at patients
with the same level of health. You can see that the white
patients have higher costs than the black patients, about
40% higher costs on average. This is because, of course,
black and white patients have very different access to care. This is a commercially and
Medicare insured population, but nonetheless of
course, there are so many ways in which black
patients on average have less access to the health
care system, less ability to take time off from work,
to get to their appointments, et cetera, et cetera. So what that means is that
at any given level of health, blacks are costing less. So if your algorithm is in the
business of predicting costs, you’re going to build in all of
the disparities of our health care system that
affect black patients into your algorithmic
predictions. So that’s the summary. The algorithm underpredicts
risk for black patients and that leads to healthier whites
getting into this program ahead of sicker blacks. Of course the proximal
cause seems clear. It’s the choice to predict
cost rather than predicting some measure of health. But what’s the distal cause? Why was that choice made? So there’s one story that I
think on everyone’s minds, which is incentives. So let’s just follow
the money and figure out why this decision was made. Of course, if you’re a
hospital, if you’re an insurer, and if you’re the
analytics company that is selling a product to
those hospitals and insurers, you care a lot about cost. But society cares about health. And so those are very
different objectives. The choice of cost
makes a lot of sense if you’re an insurer or
a health care provider. But for the rest of
society, we actually care a lot about
health, not just cost. And so you could view this
as just another instance of profit distorting health
care, which is something that we see a lot of. So that’s one story. Here’s another
possible story that I want to at least have
you guys consider, which is that we actually
don’t really know what we’re doing when we build algorithms. And we’re in the process
of figuring that out. So under this story,
you can actually go back and think about
all of the discussions about health care policy
that we often have. We often slip and use
health care costs as a proxy for health. That makes sense because people
who are sick generate costs. So you know, there’s a
certain face validity to that. But it’s also convenient. Because it’s hard
to measure health. You saw how many different
measures of health I had to put up to
convince you that people were sicker or healthier. The same problem
comes when you’re trying to train an algorithm
to predict something. It’s a lot easier to
use costs than try to come up with a very detailed
multi-dimensional measure of health. And so under this
story, cost was chosen because it’s convenient
and because we didn’t really appreciate that that
choice was going to generate this enormous bias. So why does that
distinction matter? Well here’s one reason
why it might matter. After we saw these
results, we decided to take what to many might
seem like an obvious course of action, which was, we just
contacted the company that manufactures this algorithm. And we just said, hey,
we found this problem. Did you guys know about this? And their answer
was, oh, my god, no. We did not know about this. They were incredibly responsive. They put us in touch
with their research team that’s building the algorithm. And they overall just
wanted to work with us as academic researchers
to understand this problem and to fix it. And so this has led
to a collaboration with the manufacturer
of this algorithm. They have replicated our
results on the data set that they used to
build the algorithm and found the same
thing that we found in our hospital-based
population. And together with them,
we’re developing methods to adjust those
predictions and to build a new method of predicting
that’s free of bias. And just as a
preliminary result, if you look at one
measure of aggregate bias, so at a given level of health
across the whole health distribution, how many
excess chronic illnesses do black patients have
than white patients? Under the original
algorithm predictions in their national population,
there are about 50,000 of these. And when we’ve adjusted the
algorithm in collaboration with them, there are
now only about 8,000. So that’s an 84%
reduction in bias. And that was not by
getting a new population, by doing anything very fancy
with a statistical correction of the bias. We simply help them to
predict measures of health, rather than measures of
cost in their same data set. And so I’ll wrap up by
saying that algorithms will join many, many, many
other black-boxes in medicine. And that’s OK. It doesn’t have to
be a problem as long as we follow the playbook that
we already have for evaluating black-boxes and medicine. And that’s to look
very carefully at the outputs
from that black-box without necessarily
needing to open them. And I think that
kind of scrutiny, yes, will reveal serious
problems, including but not limited to bias. But these problems can be fixed. And I think that
kind of approach will let us benefit from
all of the great things that as, Glenn
mentioned, algorithms can bring to us in
our health care system and beyond while avoiding
some of these downside risks. Thanks. [APPLAUSE] SPEAKER 1: Great, so in
the remaining time we have, which is somewhere
between 20 to 15 minutes, we are going to take
questions from the audience. If you are interested
in asking a question of either of our speakers
or both of our speakers. We have a microphone in the
center aisle, please line up. But in the meantime I am
going to exercise moderator’s prerogative to ask the first
question of our speakers, which is, Ziad, in your
presentation, you kind of go through how you caught
a pretty important bias in an algorithm that
was already being used to direct care for individuals. From both of your
perspectives, what sort of regulations
or process or norms do we need to put in
place so that we can catch these mistakes and
assumptions because Ziad, you can’t be
everywhere evaluating every single
algorithm, I assume. ZIAD OBERMEYER: Yes, it’s
pretty clear to anyone who’s worked with me that
relying on me to do anything is not a good policy solution. I actually think
Glenn is better placed to comment on this
than I am, but I think taking our cue from how
we approach the same problem from the point of view
of pharmaceuticals, it’s very clear to me that we
need more randomized trials of algorithms. Because I think,
as Glenn hinted at, when IBM is making up
synthetic data to tell us that their algorithms
are working well, it’s not the kind of thing
we’d accept in any other part of health care, right? If a pharmaceutical
company tells us, oh, yeah, we have
a statistical model and our drugs are working great. That’s not a substitute
for doing the trial. Nor should it be for algorithms. At the same time, I think
we also, from drugs, have a precedent for what
they call I think post market surveillance, but
essentially once the drug is out there and being used, our
responsibility to evaluate it doesn’t end. And I think that’s
the genre of things that I think we
are trying to do. But yes, it’s pretty clear
that that shouldn’t just be left to the initiative
of academic researchers. GLENN COHEN: So I largely agree. But I want to add an
asterisk or a caution or a concern, which
is the following. I think there’s both
culturally and pragmatically a lack of fit between FDA and
Silicon Valley to some extent, just in terms of the
frame for reviewing the idea that these
algorithms will form part of a learning
health care system, such that they will– we want them to learn as we
get more patient experience. We want to feed more data. We want to alter the
algorithm as we go. So even if you’re able to do
the initial trial at approval, it would be strange to
freeze the algorithm based on the data at approval
when new data, or stuff that Ziad learns,
shows us that it’s having this effect in
terms of race or terms of other unexpected variables. And then the question
becomes, will you go for reapproval
and at what point is that time frame and the
costs of the approval process become problematic? So I do think that there’s this
interesting question there. Having said all that, I
think always in this area, we don’t do it enough we
have to say, as against what? We have a strong status quo bias
in favor of the way medicine is currently practiced. But everything we know about
the way medicine is currently practiced tells us, no offense
to my physician friends, but much of medical practice
has very little evidence based in front of it. We don’t know if it works. We don’t know anything. So the question to
me is whether you feel as though a more lenient
regulatory process, let more algorithms through the
door, would sufficiently improve the delivery
of health care as against the status quo,
that you’d be willing to accept some of these risks. And that’s just a very difficult
question as a social planner. My own instinct is
that it’ll depend on what the algorithm is doing. For things that are offering
patients more services or targeting people
for more follow up, I think we possibly
can live with a less rigorous regulatory process. For things that are going
to be denying people things or choosing what they get in
terms of life saving areas, perhaps we want more. But it’s not just a
conversation for the experts. You as the patient population
have a right to make decisions and to participate
in this conversation. So one question to
ask yourself, if I could tell you there is
an algorithm I could apply that 90% cases would make care
better, improve your outcomes, but in 10% of cases
make it much worse and you don’t know ex ante
which of the two populations you’re in. Would you prefer to go with your
doctor and the current baseline status quo or would you
rather your doctor be directed by such an algorithm? I think that’s an interesting
question to leave hanging. SPEAKER 1: Great. Questions from the audience. Please line up. AUDIENCE: All right. Jared Silverman. First, thank you for a
wonderful presentation. Really eye opening. GLENN COHEN: Would you
speak into the mic. It’s hear to hear you. Yep. AUDIENCE: I’m interested in
whether the mechanics of code writing, we understand
how the data going in may lead to good or bad results
such [INAUDIBLE] problems. But do we also need to look
at the mechanics of the code writing? In other words, is something
lost in the translation when you apply it? I was wondering if
you even thought of that and your analysis. ZIAD OBERMEYER:
As anyone who has tried to read someone
else’s code knows, that is not a good way to
catch problems with the code. So I think it’s a really
great question, because you would think, we have the code. We know exactly the steps. But unfortunately
these– even four lines of code you know they stand
in for such complexity and how you’ve created the data
set, where the data are coming from, what the outcomes
are, that it’s just it’s very difficult to anticipate
what the outputs of that code are going to be
without seeing what the outputs are going to be. AUDIENCE: [INAUDIBLE] doing
more to try to understand that. [INAUDIBLE] accountable. If we think that there
is something lost. ZIAD OBERMEYER: Yeah,
it’s a good question. I mean think it depends on
what your prior is about whether we have the cognitive
hardware to understand what algorithms are doing. And I think that our own– I think, these algorithms
are generally more complex than we can understand. That’s a good thing because
we want them to do things that we don’t know how to do. That’s a fantastic thing. So we wouldn’t want to
necessarily hamstring the algorithms by making them
play by our cognitive rules. GLENN COHEN: I’ll also add that
a precondition of the work that Ziad does, you’ll
correct me if I’m wrong, is some access to
the verification set and some access–
or the output set and some access to
the training set data. And there’s no guarantee that
we’re going to live in a world where that’s true, especially
about the training set data. So the extent that this
data is siloed locked up as databases, that is
part of the business model of the people who are
making these algorithms, it may not always be possible
to engage in this third party scientific review of them. So one of the questions
we have to ask ourselves is what our data policy will
be as to these databases. AUDIENCE: Thanks. Thanks for that. Now, I think this is a follow
up on this last point that was made. But also perhaps mainly
a question for Ziad. It’s a really nice example and
I see how it works quite well for the case of racial bias. Because in the case
of racial bias, you can sort of examine
the output patterns. So you have this
kind of heuristic. So if there’s a
kind of discrepancy between certain groups that we
already know have identified, we know something is
wrong [INAUDIBLE] earlier. But might be other
kind of values that we want to scrutinize
where we don’t have these confusing heuristics. So I’m wondering what you– both panelists
would suggest we– GLENN COHEN: So I’ll
repeat the question. The question rose this is very
easy to do the work, not easy, but this is admirable work
you have done on racial biases but there’s lots of
other biases in there that might be less
close to the surface that we may not even
be thinking about. Is that a fair paraphrase? AUDIENCE: That’s fine. ZIAD OBERMEYER: Yeah,
I think absolutely. And I think the
oncology example I think illustrates that, yes, there
are some things we understand and we know to look for. But how is an algorithm going
to affect downstream cancer diagnoses and misdiagnoses. That’s not something that you
can do in a post algorithm audit kind of situation. That is a question that
can only be answered with rigorous evaluation
that are of the genre of randomized trials. And so, to Glenn’s point,
I think for every use case, there is going to be a
judgment call about whether we want to let these algorithms
go into use without doing the trial. But I think right
now, if you step back and look at the
landscape in general, you find a lot of people– like, no one’s even
thinking about doing trials. And I that’s a huge problem. SPEAKER 1: So because we
have a lot of questions and we only have 10 minutes,
what I’m going to do is implement
lightning round rules. So if the next two people
could ask your questions, and then the speakers will
tackle whichever question they feel is appropriate. AUDIENCE: I have a question on
the liability with a vaccine. Professor Cohen, say
something about the allocation of their liability of vaccine. And [INAUDIBLE] know a bit more. Is here any criteria as to how
the liability is allocated? What kind of internal
processes they go through and there is interaction
with the external sources. AUDIENCE: Thank you all for a
very stimulating presentation. One question of predictive
analytics back to the speakers. Given the tension,
the growing tension, between the data hunger required
for advanced machine learning, especially unsupervised
deep learning algorithms and the increasing
risk and suspicion about privacy and
loss of privacy, I’d ask you all
to either predict or were even go so far as to– what would you advocate in terms
of the evolution of our privacy rules to balance that tension
between data and privacy? GLENN COHEN: I’ll
take liability. You take privacy. Sticking with hard one, right? But he’s not a lawyer. If you want to take liability,
you can do that too. So the bad news for
my doctor friends is that they are
currently on the hook and likely to be on the hook
under the current case law about using AI directed care. And the reason is
because the doctor is the captain or, I don’t know
what the female of a captain is. I guess it’s still a captain. The captainess, the
captain of the ship, right? And the idea here is that
the expectation is, whatever is helping you, whoever
is advising you, you are making the
final decisions. That is true in the current
world we’re living in, even truer in the [INAUDIBLE]
world of AI implementation. One could imagine,
though a world where it’s no longer true. If the insurer says the
AI is recommending x. We will only pay for x. And the doctor
really wants to do y, is it fair for
the doctor to have the liability in this case? Then the truth of the matter is
the US most doctors are heavily male medical malpractice
insurer to the point that in fact they
never actually pay the money they pay
ahead of time we have a kind of social
insurance plan. But doctors don’t think
about it that way. Malpractice haunts them in a
way that economists might say it shouldn’t. So I do think that
there is some reason to reexamine the
liability regime if we’re moving to a world where
physician choice is diminished. ZIAD OBERMEYER: And just
to echo Glenn’s point, I think no one is really
seriously thinking about a world where there
is full automation, where the decisions are made
autonomously by algorithms. And so I think for the
foreseeable future, much like in other
areas like air travel, you know flying an
airplane, the algorithms are helping the human. But the human is the captain. On the privacy front,
I don’t I don’t think I have a prescriptive statement. But maybe the way
at least I found helpful to think
about it is, what are the implicit tolerances
and tradeoffs that we’re willing to make in other areas? So why am I willing to give
Google access to my location? Well, because it helps
me get around the city that I don’t know and it
helps me forecast traffic. I’m getting
something back and so it’s a bargain that it seems
like we’re very willing to make a lot of other areas. But I think that that’s the I
think the same standard that we need to apply in this area. The greater the
invasion of privacy, the greater the benefit
needs to be to justify it. But I think if you just set
any one person’s opinions aside and thought about
it in the world, how are people
making this tradeoff? I think that’s how
people are making it. It’s, you know, we’re willing
to give up more and more privacy as we get more and
more benefits back from giving up that privacy. SPEAKER 1: All right. Next two questions please. AUDIENCE: Thank you. Tony Weiss from Beth
Israel Deaconess. This is great and I appreciate
you organizing this. I was wondering about
the regulatory question that came up earlier and
whether there is a light touch regulatory approach
that could be borrowed from the
medical publishing group. So when you think about
current decision aides to physicians, in
some ways that’s the journals that we use to
guide the behaviors that we implement within the hospital. And those are really
overseen by peer review. And I was wondering
whether there is an approach to algorithms
that could be more, in a way open sourced and
peer reviewed, to allow us to put some controls over that? AUDIENCE: Hi. Thank you so much and thanks
for opening this to the public. I’m wondering how data
collection from patients in terms of Press Ganey scores
and patient satisfaction surveys, how that’s
influenced health care either for better or worse. ZIAD OBERMEYER: I really
love the peer review idea. And not to answer all of
these questions the same way, but I think if you look
at how the FDA asks pharmaceutical companies
to submit applications, it actually kind of
looks like that process. So the FDA has a very
clear set of goalposts, that these are the data
that you have to submit to get a new drug approved. Here is all of the
boxes you need to check. And then a bunch of
people at the FDA review those materials that are
given in the structured way. And so I think I love the
analogy to journal editing. I think it’s a very good– you could imagine doing
that as a first pass to say, OK, this is what we’ll get
the algorithm out there. Then we want some post market
surveillance kind of thing. But I think it’s a great idea. GLENN COHEN: I tend to agree. But I’ll just throw again, you
know, clouds, clouds, clouds, you can’t get away from it. But just to throw a little
bit of clouds here is to say, if the question is
the who, then you may be right that peer reviewers
would be just as good as FDA or someone like that. But if the question
is the what, what is it that we’re going to
require to be reviewed, we come back to what
I think are some of the fundamental
difficulties here. The gold standard is the
RCT done for the algorithm. We could require
that and say, here’s the who that’s going to do that. But the question
will be, I think, and this is something
which you have obviously thought deeply about,
for certain algorithms, can you really make an
evaluation without the RCT? And then to the
extent you’re also thinking that your algorithm
is going to be implemented in a real workflow in
a real environment, we tend to think a molecular
entity, a chemical, has the effects in the
body for the most part, pretty homogeneously
however it’s administered. An AI in the way
it’s administered and the way it
interacts with the style of practicing medicine, with
a way a hospital is run, with an insurance
system, there’s likely to be a lot
more variability. So even as to RCTs, you might
think the results of the RCTs are less probative for
other implementations, and especially if you
think function creep tends to happen with these AIs. So the who partially may solve
the problem, but the what may still get us in the end. ZIAD OBERMEYER: Yeah I’d say
that the reason that we insist on randomized trials for
drugs is because there are very real statistical
problems with generalizing from observational data. Nothing about machine learning
solves those problems. But those problems are
just as vexing with machine learning as they are in
any other applications. And so, you know the
idea that we can say, OK, I’ve fit an algorithm to
predict the results of testing among the people who got tests. Now I’m going to apply that
to people who never got tests. Those people are very different. The doctor chose to test
these people for a reason. They’re very
different in both ways that we measure in ways
that we don’t measure. So until you’ve– you
know you don’t get the benefit of the doubt. The assumption is
that that’s not going to transfer over
to the untested people until you’ve shown otherwise. GLENN COHEN: And on the
patient satisfaction score, so my brother likes this
book, Who Moved The Cheese? It’s like a business book or
something like that, right? When patient satisfaction
is the cheese as was part of the roll
out of part of the ACA, people chased the cheese, right? You would then kind of
orient your thing like that. Now what we saw, I don’t
know what opinion is, what we saw with overimportance
of patient satisfaction scores, were investment
sometimes made to improve patient satisfaction that
didn’t necessarily improve health or kind of against it. Now it is normatively
possible to have the view that health care is in
part a consumption good. And just like I like driving a
Tesla more than something else, I’ve never owned a car. But someone might enjoy driving
a Tesla more than a Maserati. I don’t know who they are. That’s fine. Let’s just take that value
neutral and give it to them. But I think most of
us view health care as having goals and measurements
that are not directly connected to patient satisfaction. In some ways, the
doctor doesn’t give you an injection so it didn’t hurt. And instead you
get the lollipop. That’s not a happy outcome. That means you are unvaccinated
and are infecting other people. SPEAKER 1: All right. Last two questions. AUDIENCE: Glenn you brought
into your clouds, the consent piece, the urgency perhaps of
dynamic consent in the world that we’re moving
through with AI. In the absence of
having something like GINA which
protects us from misuse of our genetic
information, how urgent do you think for new
legislation might be for how we think about
the use of our artificially intelligenced derived predictive
values of our health risk? AUDIENCE: Hi, thank you
for this conference. I’m both thrilled and
terrified at the same time. I’m wondering if
anyone in this arena is thinking about
the interaction between electric health
record input, data input, and what the algorithms
are being built on. Because we know EHRs
have changed behavior. They’ve changed documentation. We tend to be metric driven now. So everything changes
per the metric. And I can see if you’re
using, over a couple of years, for one for your data,
it could shift over time. And the EHRs, you know,
they’re big powerful companies. They’re not going to
do whatever you want. Remind we the one
word of the first one. This is the problem
with two questions. AUDIENCE: Dynamic
consent and GINA. GLENN COHEN: And GINA, yeah. So I’m actually working
on a paper with somebody in the audience right now
on looking at long term care insurance and life insurance
where in fact our existing protection, GINA, the Genetic
Information Non-Discrimination Act, does not protect people
against adverse inferences based on their genetic data. But the reality is,
we’re moving to a world where our ability to
make actuarial guesses about your needs is much
higher than it was before. Now there’s a
positive side of this if you believe in
actuarial fairness. Which is to say, why
just penalize you if you’re a smoker versus not? Why not penalize you
for or reward you for many more things
about you that gets the fairness more accurate? But the flip side is that,
for insurance to function has always been a
socialized market. If we could perfectly
risk stratify everyone, we’d have no insurance,
because everybody would just buy at the level their risk is. I do think, and this
goes a little bit back to the question about privacy
that we were asked before. There is a division
between people who are much more focused on
upstream protection, people who are more focused on
downstream protection. The upstream people want
to limit the amount of data you collect about people. Downstream people, and you
can be concerned with both, want to say, we’re not
so worried about the data collection, we’re worried about
particularly noxious uses of it and let’s focus on beefing
up the anti-discrimination protections. I’m probably a little bit
more in the second camp than in the first. But there’s other people who
say, that’s a fool’s errand. Because the way in which
lobbying works and Congress works and the fact that when
the Republicans were in control, they wanted to actually repeal
parts of GINA, not extend it, makes you think that
actually relying on Congress or legislation to protect
us against this kind of actuarial discrimination
might be a fool’s errand. ZIAD OBERMEYER: I’ll
just say one thing on the second question. So I think it’s an
uncontroversial statement to say that there’s
a lot of garbage in the electronic health record. I think there’s
also a lot of gold. And I think there is– it’s just like all of medicine
and all of health care. You know, there’s both. And so you know
here’s one crisp way to think about the distinction. If I train an
algorithm to predict a variable that
has a lot of bias built into it, for example cost. Even for example, the doctor’s
decision whom to test, we know that there is not just
you know bias but there error. There is just a lot of
mistakes in that decision. So you can contrast
that with saying, OK, what health outcome did that
patient ultimately have? Now there there’s still
biases built into that. But that’s at least a
way for the algorithm to learn, not from doctors
decision-making, which is, you know, error
prone and biased, but to learn from real outcomes. And so I think the more
we can get algorithms to learn from nature, from
real patient outcomes, and not from not
just mimicking what decisions doctors
are making today, the better the
algorithms will be. And I think we have both
of those things in the EHR and that’s why we have
so much, you know, so much crap coming
out of the EHR and algorithms built on that. But it’s also why I’m
fundamentally very optimistic that, done the right way,
we can have algorithms that give us the
best things that we collect in the medical system. SPEAKER 1: So I’m
cognizant of time. But we’ve only taken five
more minutes of your life than we initially asked for. Please join me in thanking our
two great speakers for today. [APPLAUSE]

Leave a Reply

(*) Required, Your email will not be published