MIC CAVAZZINI: Welcome to Pomegranate Health, a podcast about the culture of medicine. I’m Mic Cavazzini for the Royal Australasian College of Physicians. This is the third part of our introduction to artificial intelligence in medicine.
In the previous episodes we explained how to build machine learning models that assist in diagnostic and prognostic tasks; how you test them in the lab and then iron out the ergonomic friction points in the clinical workflow. As we mentioned, a lot of clinical AI projects stall at this “Last Mile” of translation.
They say the proof of the pudding is in the eating, and if we’re talking about clinical interventions, that means patient outcomes. But as we’ll hear, there haven’t been a lot of good quality randomised controlled trials for AI-assisted clinical processes, despite updates to the benchmark SPIRIT and CONSORT statements. And there’s a huge variety of tools becoming available in other aspects of healthcare, from consumer wearables to health system administration, that are challenging regulators like Australia’s TGA and the US FDA. To discuss these challenges I had great help from Associate Professor Paul Cooper, who is a member of this podcast’s editorial group.
PAUL COOPER: My name is affiliate Associate Professor Paul Cooper. I'm a sessional researcher with Deakin University and also a lecturer in health informatics.
MIC CAVAZZINI: Paul Cooper’s director at Deakin University is Sandeep Reddy, and together with other colleagues they’ve published a proposed governance model for AI in the Journal of the American Medical Informatics Association. Professor Reddy has advised the World Health Organisation and holds Fellowships with the Australasian Institute of Digital Health and the European Academy of Translational Medicine.
SANDEEP REDDY: So by background I’m a medical doctor, but I really, in the past many years, I focused on data science and much of my research is now devoted to AI applications in healthcare. But also I look at quality and safety and governance and regulatory as part of my research.
MIC CAVAZZINI: My third guest was Brent Richards, Professor of Critical Care Research at Gold Coast Hospital and Health Service and founder of the IntelliHQ training hub.
BRENT RICHARDS: So thanks. I'm a physician intensivist and previously director of intensive care and executive director. But I've always been very data friendly, machine friendly, and felt that there was a lot more we could do in that space, in that the intensive care space has got a lot of data, but not a lot of information because we have a small number of patients. And so I've always been an aficionado of scoring systems like the APACHE scoring system.
MIC CAVAZZINI: Tell us about your own clinical practice, and how your competitor to that APACHE algorithm should be used.
BRENT RICHARDS: The APACHE algorithm stands for Acute Physiological And Chronic Health Evaluation. It's a scoring system for the severity of illness of a critically ill patient. It's a predictor of outcome, so whether patients more likely to survive or not survive. One of the early trials that showed its worth was there was a—I think it was an anti-TNF alpha for sepsis—and it didn't look like it worked. But when they subcategorized according to APACHE score, they found there was the very sick where it created too much noise and wasn't going to do too much. There was the not very sick at all, where it seemed to cause harm. And there was the moderately sick where it definitely seemed to help. And so it therefore became important that you actually worked out what set of patients you were working with, and you were actually comparing like with like.
It came about in 1984, APACHE2. In some ways, it's past its use by date, because it was generated in a day where data was hard to get. And even when they brought out the original APACHE2, they were looking at 30 parameters and pulled it down to 12. Because that's what they thought people would actually use, not because that was the most accurate. And it's been subsequently improved; that was APACHE2; APACHE3; in Australia we use 3J, there's APACHE4 which is more proprietary. And it's still used in Australia today as a way to make sure that units are performing at a level that's equivalent to their peers around the country. But the original authors of the APACHE scoring system, which was Jackson Zimmerman and Bill Knaus, basically said that a scoring system should be used as a drunk used as a lamppost, for support not illumination. And I think that that still is very valid today.
MIC CAVAZZINI: And so what's your machine learning version doing on top of the classic?
BRENT RICHARDS: If you look at how the APACHE score works, it's the worst value in 24 hours for your pulse, your blood pressure, your oxygen gradients, that sort of thing. Now, we know that patients change in the first 24 hours and the worst value may not be as indicative as a whole series of values. So if you look at, for example, the pulse rate, we take the lowest value in 24 hours, but our systems are recording every minute. And we even did some earlier research where we actually showed that simply looking at the standard deviation of the hourly values added value to the prediction algorithm. So what's happening in the AI space is you can then ingest far more data both in terms of breadth and depth to give a much more fine grained and more nuanced answer built into individual diagnoses, individual patients. And then you can start to take it to the next step as you more personalise the medicine that you're going to deliver to that individual.
MIC CAVAZZINI: I spoke with these three experts together, and then followed up with them individually, so what you’ll hear today is a montage of a few different conversations. This was supposed to be the final part in the series, but there’s just too much ground to cover. So keep an eye out for one more episode that will deal with adverse events and liability issues surrounding AI-assisted decision-making, and also just what impact this newfangled ChatGPT language model could have on medical practice. Today we’re going to focus on the questions asked by regulators and clinicians before they will have confidence in these novel tools. And really, this comes down to the explainability problem.
I already described in the previous episodes how there are classic machine learning algorithms that use familiar statistical principles. And then there are neural networks, capable of handling much more complex data problems with many more variables. These models vaguely resemble the neural architecture of the mammalian cortex, and they learn by tweaking the individual weighting of tens of thousands of digital synapses known as parameters.
We call this deep learning because the models are learning and solving the problem in their own way, rather than relying on a long list of explicit rules coded in advance by the programmers. The problem is that in many cases it’s actually impossible to follow the machine’s reasoning, and whether it follows the same logic that we humans would use. This ‘black box’ is very unsettling to some clinicians and regulators who say that for high stakes clinical decisions we should only be relying on models that are intrinsically explainable. Others argue that, in fact, we don’t really understand the function of many drugs, and yet we’re still able to validate their performance against quantifiable clinical outcomes. Brent Richards is of this camp when it comes to trialing the performance and safety of deep learning AIs.
BRENT RICHARDS: In medicine, we're actually fortunate because we've got a very long history of trying and testing and even working with technologies, and particularly drugs, that we didn't completely understand. I take you on the journey of theophylline, which started in the 1950s. We didn't really know how that worked. Well, we thought we did, but we didn't. And that's been through numerous iterations as to how we thought or how we think that it's worked. But while we've been using it, we've actually been testing the outcomes as we go along and putting the guardrails around it. So in many ways AI will be will fit somewhat into that space. And if I look, again, back to the theophylline, it's now considered to have some immunological effect. Now, I can assure you that I don't understand the immunological effects of theophylline and I would struggle to explain them or get them explained to me. So I think that we need sufficient experts in the space who can dive deep—like an immunologist who could then dive into theophylline for me.
If you look at our Phase 1 2 3 4 bringing a new drug to market, we can pretty much align that in the AI space, where you develop a model, you test a model, you test it in consort with what you're currently doing. But then you can have a very aggressive Phase 4—which is uncommon in the pharmaceutical space. But a very aggressive Phase 4. Why? Because for the algorithm to run, it's got to collect the data. And if you're collecting the data to use the AI tool, then you can continue to collect the data to continue to monitor the outcomes. So we've actually got an opportunity which is way better than what we have in the drug space, where often where the data that's collected to bring a drug to market is isn't continually collected.
MIC CAVAZZINI: What do the early phases look like? I’ve heard of terms like “shadow implementation”.
BRENT RICHARDS: So the Phase 1, 2 and 3 in many ways can be a lot quicker, because you're using data that's already there and recruitment that's already there.
MIC CAVAZZINI: In that some of those data in the lab where you've been testing on images that have been captured already.
BRENT RICHARDS: Yeah, so you do that original, and then you take it out to a different, shall we say, treatment group. And that’s your Phase 2. And then Phase 3, you broaden that out again and you’re there.
MIC CAVAZZINI: And then what do you mean exactly by aggressive Phase 4 and the guardrails around that? Does aggressive mean on a scale larger than what we’re normally used to in terms of patient recruitment?
BRENT RICHARDS: When I talk about an aggressive Phase 4, I mean, far more intricate than current Phase 4 trials. So it's, in some ways, aggressive against the algorithm, not aggressive for the algorithm. If you look at most of our Phase 4 trials currently, is that they tend to be quite soft and marketing. Why? Because they're quite expensive to run, to run a Phase 4, and they're quite challenging. Whereas when we're talking about using algorithms with data, we're already in the market collecting the data and using the data as part of the algorithm. And given the fact that you'll have a continuously changing landscape, both in terms of data, data drift algorithm drift, then we need to be monitoring the outcomes of those AI far more than what we do in our normal Phase 4 implementations.
MIC CAVAZZINI: And you talked about guardrails, what would they be specifically?
BRENT RICHARDS: Again, it's to make sure that we don't continue with some of the mistakes we occasionally make in the pharmaceutical space where drugs start to get used outside of what they were envisaged for. Some drugs end up being withdrawn early on in their Phase 4 life, simply because they've been used off licence. Because we'll have again, a very clear view as to where these algorithms are being used, because we're in the data, we can ensure that ongoing appropriate use is there. And if there's any movement into, I'll call it different populations, then that's carefully monitored, that becomes part of another trial. So again, these are guardrails we have always wanted to have the drug of orientation space, but we haven't had the tools to do it.
MIC CAVAZZINI: And there's a certain maybe a greater degree of transparency in that everyone can see what you're doing.
BRENT RICHARDS: And it's actually something that research has got a lot to learn from data science. If you look at how we often do research, we get some data and we put it into some Excel spreadsheet. And then there's a bit of magic in terms of a few pivot tables and a few alterations and how we manage with the missing values. And then we sort of magically stick it into SPSS and push a button and we get a few more results. And all you report in the papers, “I put it into Excel and SPSS and this is what came out.” And it's completely unable to be reproduced.
The way we currently do business, but in the data science world, you get the data, you keep the raw data as raw data. And then you build your Jupiter notebooks and Python around that such that you extract the data in a certain way. But all those steps are very clear. Why? Because their programmed steps, you can actually see exactly what it is. And there are some people now getting the data and then rolling it up as a version and putting it in what's called a docker container. And then adding in the programs into it saying that, “Okay, you can now run exactly that same set of tests that we did for the next 10 years”. And that's the rigor that's built into a lot of data science, and we could learn a lot from that.
MIC CAVAZZINI: Paul, in that governance paper, you and Sandeep wrote that the onus is on developers to make decision-making by deep learning machines more transparent. And there are tools like heat maps and decision trees that can help you follow the AI’s logic in a visual or sequential way. Can you explain a bit more how these work and whether they really cut it?
PAUL COOPER: In AI, for example, in diagnosing diseases for medical images, heat maps show which regions of the image influence the AI's diagnosis. That can allow humans to understand the reasoning of the model to some extent, by visually indicating which features the model considers important. The heat maps gives you some form of local interpretability of the AI decisions. With decision trees, these are these are tree like models that show a path of decisions that the AI made based on the data features. So the final decision is made at the leaves of the tree, and in this way, every prediction made by the decision tree can be explained by a clear path of decisions.
Now, while those techniques can increase the explainability of AI model, they're not necessarily suitable or sufficient for all kinds of models or tasks. And more complex models such as the ones we've been talking about recently, the deep neural networks, these are much harder to interpret. Heat maps, for example, may only provide a partial view of the decision-making process. So there's other techniques that are likely to be required, this is still a very evolving area. And there are there are other people with much more expertise in this to talk about that.
MIC CAVAZZINI: Sandeep, in a commentary for Lancet Digital Health from last year you didn’t seem ready to give up on explainability as a requirement. You implied that black boxes really rubs against the principle of evidence-based medicine that defines this generation of medicine, and that even outcomes from RCTs aren’t bullet proof in critical decision making. Can you elaborate what you meant by this?
SANDEEP REDDY: So let me clarify what I meant in that particular article. We don't need explainability for every AI application in healthcare. Sometimes it's unnecessary, and cumbersome to adopt explainability frameworks. What I was suggesting is if you have high risk situations, high acuity medical scenarios, it's at that time that we need an explainable a framework or interpretable algorithms. The first priority or the first order, would be looking at it looking at classical traditional machine learning algorithms, if they can yield good results. If not, then you would progress to neural networks. And then when you look at neural networks, is it possible to have an adjunctive explainable framework like LIME, Shapley value and so forth, which can help, if not exactly explain how the decision was arrived at, give some sort of indication of what features and what variables were important in the decision making process. So that way, we have some level of confidence in the application. But my concern is that we have come to a situation where we completely ignore the aspect of explainable AI and the cost of performance and accuracy. There is now an abundance of explainable tools that you can use to help the user know how the decision was arrived at. So there is no real excuse for people who object to explainability when AI is being applied in healthcare.
MIC CAVAZZINI: And you did mention a study with the Scottish health system where, in fact, deep neural networks were not always the best option.
SANDEEP REDDY: True. So we conducted a rather a project rather than a study to build a decision making tool which would allow cardiologists to refer patients to ambulatory monitoring of blood pressure. So you know, it's a chronic condition, so you are on a long term medication, so you would want to be very assured that whoever you're confirming or diagnosing with hypertension, indeed has hypertension. So we devised a tool, which would allow the cardiologist to indeed confirm who indeed had hypertension or who had white coat hypertension. You may be familiar with that term, those people who do not have hypertension, but when they come to the clinic or the hospital environment, they have an elevated blood pressure.
So we devised looked at various algorithmic models to help with that prediction. And we found that gradient boosting and decision tree models, which are interpretable algorithms, are much more accurate than neural networks in that particular situation. So there is quite a few studies which actually showed that traditional classical machine learning algorithms sometimes perform better than neural networks. Neural networks are not necessarily the first sort of choice that you shouldn't be looking at.
MIC CAVAZZINI: A better use of all resources, yeah.
BRENT RICHARDS: It's not maybe quite true now, but for a long time, the chess programs were better than the average human, but not the chess master. But then they worked out that if you put a computer program next to a human and got them to play together against another opponent than the two together, were at Grandmaster level.
MIC CAVAZZINI: There's a there's a few good quotes in there, like, you know, it's not that AI will replace humans, but humans with AI will replace humans without AI.
BRENT RICHARDS: Exactly. I was talking probably 20 years ago at a junior doctors’ conference. A junior doctor stood up and said, “But am I going be replaced by a computer” and I did quip straight back saying, “If you think you can be replaced by a computer you probably should be”. There was a nervous giggle and laugh in the room. But yeah, it as that because most of what computers and algorithms do, a lot of it's in a fairly narrow pedestrian space.
MIC CAVAZZINI: Most listeners would be aware of the reporting standards for randomised controlled trials. The SPIRIT checklist contains 33 items that must be published even before a study is conducted.; research protocol, primary outcome measures, proposed samples size and so on. Then the CONSORT statement dictates a minimum level of transparency with regards to data publication, and that researchers must account for every participant that started a trial regardless of whether they reached the endpoint or not. In September 2020 extensions to these statements were released relating to the reporting of RCTs of machine learning devices. CONSORT-AI contains 14 new requirements, including an explanation of how AI outputs are to be used during decision-making.
That was an important step towards a universal research standard but two years after this update, a systematic review was published in JAMA Open that wasn’t especially favourable about the quality of evidence produced so far. Of 41 RCTs of machine learning algorithms for clinical decision making, none met all the standards of CONSORT-AI. Nonadherence usually revolved around failure to assess data quality or to analyse performance errors. Less than a third of trials reported the ethnicity makeup of participants in the training data. The authors observed that there were 343 US medical AI devices approved by the US Federal Drugs Administration at the time, and that these must have passed a very low burden of evidence compared to drugs of the past.
So where are we at in Australia? Traditionally the Therapeutic Goods Administration hasn't been too worried about software that was bundled up with nuts and bolts machines. The suppliers were well known, the provenance of any code was easily traced, and the package as a whole had to meet quality and safety benchmarks. Indeed, you’ve certainly already used AI-assistance for the differentiation of traces on an electrocardiograms or for image enhancement of scans and pathology slides.
But in 2021 the TGA updated its definitions, recognising that software was developing at a much faster pace than before, and could be downloaded from so many different sources onto so many different devices. Now any software involved in diagnosis, prevention, monitoring, prognosis or treatment is considered a medical device that needs registration. Since that shift two years ago there have been only four SMEDs or “software as medical device” registered by the TGA, compared to the hundreds seen abroad. I asked Sandeep Reddy if we were being Luddites or whether this caution was warranted.
SANDEEP REDDY: So I was at San Diego, as you're aware, Mic, and part of that event was a presentation by the FDA, as to their work towards regulating AI. So we know that machine learning, as the name indicates, it's learning about the data, it's learning about the context of learning about the environment, and then when you deploy the algorithm in a different environment, again, it's a different process, so they don't want the vendor to go back to FDA each time they update the algorithm. So they have something called an adaptive process, to make it easier for those vendors and developers. So they're evolving with the algorithmic development.
TGA is rightly focusing on the end outcomes of the particular application that is looking at whether it's going to cause harm, or what's the risk around it. So that would apply to whether it's an AI algorithm or a medical device. So from that point of view, I don't necessarily think we are behind, we are indeed, looking at the end outcomes, making sure that the application doesn't result in patient harms. But where I think we are falling behind is that lack of a specialized pathway for AI algorithms. So that way, if I were a developer, or a vendor, approaching TGA, with an AI application, I have a clear understanding of what is required, but also TGA gets what it wants from the vendor or the developer.
So I think from my own experience—outside my academic [role], I'm also an entrepreneur—I'm going through the TGA regulatory process, that's pretty straightforward. Where the challenges for small to medium enterprises, the time it takes to get the approval and also the costs involved. And that's no wonder you don't see a lot of AI applications have in the RATG, which is the listing of all the medical devices whether it's medical software or medical device, you don't see a lot of them. The other aspect to why there are fewer AI applications here in Australia is the market is purely market economic sense. Is it worthwhile for the vendor or the sponsor to go through that process, when they could give that occurred the same time in Europe or in the US and get the access to a larger market?
MIC CAVAZZINI: I found it interesting that in that the small pool of devices approved in Australia; only one of these is for use by clinicians, it’s called iPredict, not a very original name. And it’s for use with retinal imaging devices to help with early detection of diabetic retinopathy, age-related macular degeneration and possible glaucoma. But the first registered device was actually KardiaMobile, which collects ECGs in the ambulatory setting through a fingertip sensor and uses software on the consumer’s smartphone to screen for various forms of arrythmia. And it’s been followed by software on the FitBit and Apple Watches which does a similar thing as an “over-the-counter” arrhythmia detector. All four of these devices are Class IIA despite their different audience of use. So what does this classification actually tell us about their usefulness or their risk?
SANDEEP REDDY: TGA has this classification, Class 1, Class 2, a Class 2B, Class 3 and Class 4, based on the risk and the level, the particular application replaces clinical decision making. So if to give an example Class 4, which the highest, would be implantable devices like IUD, contraceptives, pacemakers, and so forth. So you can start to see that anything that gets implanted or is penetrative has an higher risk. The other aspect to that classification is if it's patient-facing, the risk is higher, because the patient doesn't have the knowledge to interpret and anything that is consumer facing would be accorded a higher risk category. I'm not familiar with the exact specifications that were provided to the TGA by those vendors and why they have been given that particular classification, I can just guess, based on the risk level they would not be replacing a clinical judgment altogether, there would still be some sort of clinical supervision or clinical oversight. And that mitigates harm or risks that can arise of those devices. Though, you know, can get a variety of applications within the same class category, because the TGA is not so much concerned about the algorithm used or the data use, rather than it's more concerned about the end outcomes.
MIC CAVAZZINI: So far we’ve talked about the difficulties for physicians around understanding, but we’ve not mentioned the patient, the consumer. Sandeep, you made some quite strong comments about this in that Lancet article; “Reliance on the logic of black-box models violates medical ethics… When clinicians cannot decipher how the results were arrived at, it is unlikely that they will be able to communicate and disclose with the patient appropriately, thus affecting the patient's autonomy and ability to engage in informed consent.” That’s an important point, but is it so different from the public’s understanding or lack of understanding of how a biomarker assay works? That it can produce false positives and false negatives some fraction of the time.
SANDEEP REDDY: So I don't want to delve too far into the future, but right now, when we look at AI applications that are already deployed, most of them are doing triaging as opposed to actual final diagnosis. So then again, if you go back to the machine learning outputs they're solving a classification problem whether the person has a disease or not, they're actually not kind of monitoring the patient ongoing, it's not a regression issue. So I think when we get to that point where the—like, say, for example, an AI system is completely monitoring all the ICU patients, and alerts the ICU physicians only when something goes wrong with the vitals or with other relevant clinical signs.
But it might be as with any software, or any machine, things can go wrong. It's at that time, when you actually want to audit it or assess it, you don't know how it was giving you that result when things go wrong, and that's the concern I raised in that particular statement. But also, as part of that post review when things go wrong, the patient also requires some level of explanation, so I think it all kind of makes sense. If it was just your ordinary, low level, low acuity kind of screening and examination with human oversight, probably that statement sounds harsh. But then if you put it into those kind of situations, high risk situations, it makes sense.
MIC CAVAZZINI: I mean, in the governance model for AI that you and Paul described transparency and trust are two of the main pillars you build in there. So would it be enough to just tell patients or their families, “I’m going to look at your scans, there will be an AI helping me interpret them?”
SANDEEP REDDY: Yeah.
PAUL COOPER: I’d just like to build very much on what Sandeep just discussed. It’s just a perspective of mind that I think the public view of AI at the moment is still on the fence, it's probably a little bit on the positive side. But I think this is likely to be even more positive when agents begin to help them with tasks such as household administration, financial planning, that sort of thing. So people will then tend to discount the risks unless they're personally affected. In my view, this means I'll become increasingly frustrated with the medical profession if they're not seen to be using the latest approaches that people have personally found could save time and lower costs and improve service. So to some extent, I expect consumer pull towards medical use of AI is probably going to take the medical profession by surprise in terms of its pull.
MIC CAVAZZINI: In your governance paper, Paul, there’s also the question of trustworthiness. Do researchers and commercial operators seek informed consent to use consumer data to train these models? There was already an episode in 2019 where Google and Uni of Chicago health service was sued for sharing patient data with AI developers. Don’t those kind of standards already exist with regards to patient records and digital health or is there a part of the framework needs strengthening?
PAUL COOPER: So generally, researchers do seek informed consent for use of consumer data. And there are, for example, very strong protections for access to patient medical record information. However, the history of informed consent outside of the direct medical record access has not always been a happy one. For example, a couple of years ago, with consumer devices such as CPAP sleep apnoea devices in the US, they were found to be sending consumer data through to the health insurers. Now, in some cases, the usage was declared, but it was in very fine print, so to speak. And this is especially troubling with mobile health apps, which exists, as we've already discussed, in this grey regulation zone outside of the direct TGA control. So some of my university research has been about frameworks, developing frameworks for assessing and rating such apps. My summary would be that the frameworks and privacy protections are in place for the TGA-regulated apps and devices. But the transparency and clarity of adherence to relevant privacy protections for other health applications needs continuous scrutiny, for example, by consumer groups. And that's where I think the emphasis needs to be.
MIC CAVAZZINI: Many thanks to Sandeep Reddy and Brent Richards for sharing their expertise. And a special thankyou to Paul Cooper for his ongoing support of Pomegranate Health over the last few years. He’s one of many Fellows, Trainees and other kind souls who provide editorial feedback in the development of these podcasts. You can find all of the reviewers credited on the unique web-page for each episode, along with a complete transcript and loads of linked citations and interesting webinars.
As I mentioned at the top, there’s one more episode in this series, which will tackle the medicolegal questions around AI in medicine, and also what all the fuss about ChatGPT is, and what it could mean for your practice. This field is a constantly evolving one, but rest-assured that your colleagues who make up the RACP’s Digital Health Advisory Group are on the case. They’ve already developed a primer that highlights how different digital health activities fit with the three CPD categories.
One of these Fellows is Rahul Barmanray, you might have heard in a recent episode of IMJ-On Air. If you log into the RACP Online Community app you’ll also find that Dr Barmanray has started some threads about artificial intelligence and these podcasts. Feel free to carry on the discussion there, and value add to this CPD activity.
And if you’re really enthusiastic about the show, you can leave a review for Pomegranate Health through your listening app, whether that’s Spotify, Castbox, Apple Podcasts, or even the College website, racp.edu.au/podcast. That’s where you’ll find our subscription list and email address too. This podcast was produced on the lands of the Gadigal people. I pay respect to their storytellers who came long before me. I’m Mic Cavazzini. See you next time.