Ep95: Machine Learning 101

Ep95: Machine Learning 101
13 June 2023

Fellows of the College can record CPD credits for time spent listening to the podcast and reading supporting resources. Login to MyCPD, review the prefilled activity details and click ‘save’.

AI-assisted healthcare is reaching maturity in many applications and could alleviate some of the capacity gap increasingly faced by health systems . Over the next three podcasts we focus on artificial intelligence tools designed to assist directly with clinical practice. 

Most commonly reported on are the algorithms capable of pattern recognition on medical images, that in some settings perform as well or better than expert diagnosticians at classifying disease. AI models have also been developed to perform regression analyses more complex than classical risk stratification aids.

The standard statistical algorithms used to solve these problems struggle when many variables are introduced, in which case deep learning models that mimic brain networks are sometimes a powerful alternative. In this episode we explain how machine learning algorithms are trained on particular tasks and where there are risks of error and bias being introduced. 

In part 2, we identify the ergonomic issues that affect practical implementation of AI tools in the clinic and in the decision cascade. And in the final episode of the series we discuss the questions that regulators and lawyers should be asking of this new technology and what role natural language processors might have in medicine. 


Dr Ian Scott FRACP MHA MEd (Director of Internal Medicine and Clinical Epidemiology, Princess Alexandra Hospital; Professor of Medicine, University of Queensland)

Produced by Mic Cavazzini DPhil. Music licenced from Epidemic Sound includes ‘Thyone’ by Ben Elson, ‘Broke No More’ by Cushy, ‘Desert Hideout’ by Christopher Moe Ditlevesen and ‘Alienated’ by ELFL. Music courtesy of Free Music Archive includes ‘Capgras’ by Ben Carey. Image by Olemedia licenced from Getty Images.

Editorial feedback kindly provided by physicians; Rhiannon Mellor, David Arroyo, Aidan Tan, Joseph Lee, Rachel Murdoch, Michelle Chong, Phillipa Wormald and digital health academics; Paul Cooper and Natasa Lazarevic.

Further resources

Demystifying machine learning: a primer for physicians [Scott, IMJ. 2021]
Clinician checklist for assessing suitability of machine learning applications in healthcare [Scott, BMJ Health Care Inform. 2021]
Cardiac imaging: working towards fully-automated machine analysis & interpretation [Expert Rev Med Devices. 2017]
Deployment of machine learning algorithms to predict sepsis: systematic review and application of the SALIENT clinical AI implementation framework [J Am Med Inform Assoc. 2023]

What is needed to mainstream artificial intelligence in health care? [Scott, Aust Health Rev. 2021]
The Last Mile: Where Artificial Intelligence Meets Reality [Coiera, J Med Internet Res. 2019]
We need to chat about artificial intelligence [Coiera, MJA. 2023]
Health Care in 2030: Will Artificial Intelligence Replace Physicians? [Ann Intern Med. 2019]
Integrating a Machine Learning System Into Clinical Workflows: Qualitative Study [J Med Internet Res. 2020]
Exploring stakeholder attitudes towards AI in clinical practice [Scott, Coiera, BMJ Health Care Inform. 2021]

A governance model for the application of AI in health care [Reddy, J Am Med Inform Assoc. 2020]
Machine learning in clinical practice: prospects and pitfalls [Med J Aust. 2019]
Evidence-based medicine and machine learning: a partnership with a common purpose [BMJ Evid Based Med. 2021]
Explainability for artificial intelligence in healthcare: a multidisciplinary perspective [BMC Med Inform Decis Mak. 2020]
Artificial Intelligence for Business : A Roadmap for Getting Started with AI [2020 book, Wiley]

Presentations and Webinars
Machine learning and low value care - promises and pitfalls [Ian Scott and Jonathan Chen]
Training residents when machines read imaging [Jaron Chong]
Artificial Intelligence and Systemic challenges [Andrew Connelly]
The impact of AI on medicine and medical education [Enrico Coiera]
AI for Healthcare: principles, challenges and opportuninties [Olivier Salvado]
AI for healthcare: your questions answered [Clair Sullivan, Enrico Coiera, Olivier Salvado]
A Friendly Introduction to Machine Learning [Serrano Academy]


MIC CAVAZZINI:               Welcome to Pomegranate Health, a podcast about the culture of medicine. I’m Mic Cavazzini for the Royal Australasian College of Physicians. There are two themes that pop up in the media on a weekly basis now. The first is the inability of our health systems to keep up with the demands of a growing population that has ever more complex care needs. The second is the rising sophistication of artificial intelligence. As well as the chatbots and graphical generators making sensational headlines, AI-assisted healthcare is also reaching maturity in many applications. If implemented right, this could allow health systems to do more with less and, importantly, protect personnel from becoming overstretched and burnt out.  Machines have been developed that can predict delays in the waiting room and help triage patients. Others are being used to help rationalise design of drug molecules and vaccines. And there are chatbots that in trial settings have provided highly-rated communication to patients and mental health consumers.

But for the purposes of the today’s podcast, we’re going to focus on examples of artificial intelligence that brush up against your responsibilities as a clinician; machines that can recognise pathology from diagnostic images and ECGs, and others that can assist in prognosis. In the next episodes we’ll look at how to integrate AI smoothly into the flow of clinical decision-making, and what role natural language processors might have in practice. Also, critically, what are the questions that regulators and lawyers should be asking of this new technology.

Of course, when we use the term artificial intelligence in this conversation, we don’t mean self-aware AIs that will take your job before taking over the world. We’re really talking about a sub-category known as machine learning; statistical models that recognise patterns in data and help solve specific problems. These models aren’t programmed with a long list of explicit rules, but instead they learn the requirements of the task through exposure to a training environment. Depending on the quality of that training, AI tools can develop more nuance that medical practitioners or otherwise end up amplifying human biases.

We’ll come back to that later, but let’s start with the fundamentals. There are three flavours of machine learning. The first are classifier models that answer diagnostic yes or no questions like, is the lesion on this scan a cancer or not? Secondly there are regression models, suited to perform risk estimation such as 30-day outcomes of patients admitted with PE. Both regression and classifier models undergo what’s known as supervised learning—they’re told explicitly what a case looks like during the training phase. That’s in contrast to the third category of machine learning, which doesn’t need human supervision to sort the data. To guide me through all this I spoke with Professor Ian Scott in Brisbane. He has a great review in the Internal Medicine Journal titled “Demystifying machine learning: a primer for physicians” . Please excuse the poor recording we got over the online chat.

IAN SCOTT:         Professor Ian Scott, Director of Internal Medicine and Clinical Epidemiology at Princess Alexandra Hospital. And I also chair the Metro South clinical AI Working Group and the Queensland Health sepsis AI working group. And I'm Professor of Medicine at the University of Queensland.

MIC CAVAZZINI:               Thanks for coming back on the podcast a third time, I think that’s a record. First, I want to skim over some of the background concepts. So at the risk of oversimplifying, data problems come in different shapes and it takes different statistical models to address them. Unsupervised models, by definition, don’t require training but they’re good at sorting data sets into clusters. So just to give one example, and I think I pinched this from one of your lectures, from a 2019 paper in JAMA authored by researchers in Pittsburgh. They had electronic medical records from over 60,000 patients who had developed sepsis in hospital. They filtered the data down to 29 patient variables including demographics, vital signs, markers of inflammation. And the learning machine then churned away to find that the patient data clustered around four broad sub-phenotypes of sepsis. Patients with renal dysfunction were distinct from those with inflammation and pulmonary dysfunction and separate again from those with liver dysfunction. And the biggest group was made up of those with lowest administration of a vasopressor. Very briefly, what is this k-means clustering model doing that more classical multivariate statistics might not be able to handle? 

IAN SCOTT:        
Essentially, I mean, what they're trying to do is they’re taking all these variables, and they're trying to then group patients, according to some commonalities, common characteristics. So they have—there's minimal variation between the patients within the cluster, so the actual mean, distance between individuals in the cluster is very small. But the distance between the clusters is maximal—is made to be as large as possible. So you’re really trying to separate them out as newly identified groups. A multivariate analysis doesn't really allow you to do that, okay. Multivariate analysis, in many respects, is just trying to average or predict a risk in a whole group of patients taking into account various other covariates or other influences on that particular outcome. So it doesn't actually cause it doesn't allow you to cluster. It just simply allows you to predict a risk amongst a group of individuals.

MIC CAVAZZINI:               Identifying sub-phenotypes as we’ve described could help to personalise management strategies for the most benefit. But the models themselves are more likely to be used by clinical researchers, rather than practitioners making decisions in real time. We’ll describe how diagnostic AIs are trained and tested in a moment. But it needs to be said that like any statistical algorithm, machine learning models have trouble finding patterns in noisy data and this problem is amplified the more variables you expect them to deal with. To address this, you can apply reinforcement learning. This is a bit like training a dog, in that you reward correct responses and penalise mistakes. As you can imagine, this speeds up training significantly and is best suited to very dynamic problems. Or you can get different models to work together to smooth over errors made by any individual algorithm. This is called ensemble learning, and there are different permutations where models are variously stacked in parallel or in series.

But the big one you always hear about is deep learning. These models aren’t built from classic statistical algorithms like the ones mentioned so far but are instead loosely modelled on neurobiological architecture. There are neural networks suited to solving different types of problems, but one of the most advanced fields is computer vision.  Remember those lectures back at uni on the layers of the human visual system. Photoreceptors in the retina are activated by light and in turn they activate bipolar and ganglion neurons. Even within those few processing layers, information from adjacent visual fields is convoluted, such that downstream neurons in the thalamus are able to recognise steps in contrast. As you progress through the cortical association areas that information is convoluted further, leading to neurons that can recognise edges, then primitive shapes, and finally, objects, hands and unique faces.

The brain’s visual system is made up of a few hundred million neurons connected by hundreds of billions of synapses. By contrast, Google’s leading computer vision model only has around 25 million connections known as parameters, organised into 42 layers. Just like the brain’s synapses, each of these parameters get strengthened or weakened depending on its contribution to meaningful neural output. This particular model, known as Inception version 3, is not the biggest example of a convolutional neural network, but it’s a good compromise when it comes to accuracy and processing speed. Just as human vision gets tuned up in the six  month critical period at the start of an infant’s life, the Inception model is pre-trained on over a million images. This gives it a basic visual grammar before it then undergoes more focused training by researchers.

Convolutional neural networks or CNNs, have been variously trained to spot cancers from thyroid ultrasounds, chest CTs and mammograms. One of the first tools registered with the TGA helps with early diagnosis of various retinal diseases. Echocardiograms, ECGs and endoscopic images are also amenable to computer-aided classification. But for the next walk through I’ve picked a Nature paper from 2017. The team from Stanford university trained the Inception model to classify skin lesions by exposing it to almost 130,000 images. They say this was a training set two orders of magnitude bigger than had ever been used for this purpose. I asked Ian Scott why this was significant, but also what the steps of filtering, cleaning and labelling the training data are all about.

IAN SCOTT:         Well, I think when it comes to deep learning, and that's what that application is, then the more complex the model, then the more images that you need for training, okay? I mean, that's the general rule, there's some changes made in more recent times, but basically, the more complex the model then the more data you need. And any model, whether it be unsupervised or supervised, really is dependent on the data that is fed into it. Okay, so garbage in equals garbage out. So if you're putting in wrong data, incomplete data, faulty data and noisy data that’s not clean and properly entered accurately, that's going to interfere with the model. As a result, in terms of filtering and cleaning, what that's about is, taking out stuff that's going to make the model more inaccurate. In other words, if you’ve got, badly-photographed images, or smoky images, or there's lot of sheen on the image, or it's sort of orientated wrongly, it's got the wrong angle to it, etcetera, okay. Then the programmers, the data scientists will try to remove those images because they're just going to cause more…

MIC CAVAZZINI:               It slows the learning.

IAN SCOTT:         It slows the learning. Okay, so what you want is good quality images. Now, the next step, though, which is different, is labelling. So, in other words, you want to train the algorithm, knowing that in this particular image, this patient had a BCC, or keratinocyte or melanoma or whatever it might be. Now, who actually decides what's on that image? Well, it's the dermatologists. How do they label the image? Well, preferably, what you want is a panel of pretty expert, dermatologists who look at the image independently of each other, decide, but what I think this is, what’s most likely and perhaps least likely. And then you get them to come together as a group. And then you say, right, here's what we sort of saw as individuals, now we need to come to some consensus as to what are we going to label this as definitely yes, definitely no, as to whether it's melanoma or keratinocyte etcetera. And that's the effort and time-consuming part because you need to then get a panel of clinicians together to actually do that labelling.

MIC CAVAZZINI:               And you’ve sort of answered my next question. As you say, the aim was to differentiate keratinocyte carcinomas from benign seborrheic keratoses and malignant melanomas from benign nevi. But are they only choosing black and white cases or are borderline cases included too. Is there a degree of clinical subjectivity in the labelling, or do want them to be as black and white as possible?

IAN SCOTT:         Well, we're try to get them as black and white as possible. There's always going to be a certain degree of subjectivity here. But at the same time, I think we need to realize, well, we can't just rely on the algorithm given perfect images, where there's 100% consensus of all dermatologists that this is definitely melanoma or something else. Having said that, though, I mean, you don't want a lot of disagreement, either. That's not going to help either. So you want some sort of in between. And I think this really is again, the challenge of a lot of machine learning, is that, okay, where do you draw the line at what you put in as a good quality and an agreed diagnosis versus what you don't put in? And that's why I think, again, we need to always consider that the model in the way it's trained and developed can then perform differently once in a noisy real world situation where then you do have images that maybe have somewhat poor quality, so it's not going to perform as well.

MIC CAVAZZINI:               There was noise, or at least variability, in terms of the skin conditions. The authors say there were more than 2,000 labelled conditions in the image set and the model built a taxonomy around the three branches; benign, malignant and non-neoplastic branches. So I guess that allowed them to see that the model was using a reasoning that they could share. And then to see how well the model is learning you need to test it on data it hasn’t seen before. The authors say that a set of 50 to a 100 images was enough in the testing phase. If the model isn’t quite performing as accurately as you’d hope, do you just keep exposing it to more data, or do you intervene to give the model a more guidance in its learning?

IAN SCOTT:         Well, a bit of both. I mean, what you're doing there is you're saying, righto, where's the model disagreeing? Or where's the model prone to error? So are there particular image types, or are there particular types of lesions where it's getting it more consistently wrong. In which case then you feed it more examples of those, so it then can try to learn its errors, so to speak, and then improve over time.

MIC CAVAZZINI:               Now during this sort of testing you can reveal some awkward findings. The same researchers published what you might call a bloopers paper, showing one of their models got very accurate at picking malignant lesions, but for the wrong reasons. Can you describe this paper and tell us what “the Clever Hans” phenomenon is, based on a German horse that was apparently good at arithmetic.

IAN SCOTT:         Apparently so. Well the clever Hans phenomena is where, okay, the model makes the right prediction but it's using the wrong features for the wrong input data. It’s using what we call metadata so there's other things that are put into the image that the algorithm is using to make a prediction, but it's not really intrinsic to the image itself. So yes, if someone's put a pen or a marker to say that, you know, this is a melanoma and they’ve up outlined it because it might be a serial photograph taken at one point in time and then subsequent time. And if those markings, for example and more commonly done in people where the dermatologists are worried this could be malignant rather than benign, then it'll learn, okay, well, anything that's got a circle around it is probably going to be more likely to be a melanoma. So you need to make sure that those artifacts then are totally removed. And it’s just looking at the raw image.

MIC CAVAZZINI:               Yeah, pen marks, and I think rulers are another one. And there was a similar paper where an AI that had learnt to diagnose pneumothorax from chest x-rays was latching onto the presence of chest drains as a predictor of disease. Those authors call this “hidden stratification” and it’s a well-recognised problem. One of the go-to examples, the classic examples, was when Google’s Inception was in its early stages it was trained to distinguish pictures of dogs and wolves. But when it misclassified a husky dog, the researchers realised that it was thrown by the presence of snow, which was found in a lot in images of wolves. These are true associations, but they’re obviously not meaningful. How do you prevent the machine making such mistakes? Do you just try and clean the images up, or clean the training set up more?

IAN SCOTT:         That’s right, yeah, you need to diversify the training set so it doesn’t include all those wolves with snow in the background. So again, it'll learn and associate according to what it sees most frequently. It's just a simple correlation with the frequency of a certain background. So you just need to make sure that okay, the training data set contains a diversity of image in which case in this case, image backgrounds.

MIC CAVAZZINI:               One of the promises of machine-driven classification is that would cut out the biases of the human diagnostician. But ironically, human biases can actually be crystallised and amplified if the models aren’t trained on a diverse dataset. There have been some cringeworthy fails outside the medical sphere. In 2020 Google’s Vision Cloud identified a white hand holding a digital thermometer as an “electronic device”, while a black hand with the same device was said to be holding a gun. In the same year a black man was wrongfully arrested in Detroit on the back of a match made by facial recognition software used by the police. These are better at recognising white faces than black faces. In the clinical context, how does it bode for the generalisability of this dermatology app that all the images shown in that research paper were of lesions on white skin?

IAN SCOTT:         Well, that's right and I think that’s been borne out. In fact, no matter what type of model you’re using if your training set isn't representative of the total target population that you're looking at—in other words, you're leaving out certain racial groups, older patients, then again, you're biasing the model because then it's going to perform well on the populations you've entered, but it's not going to perform well on the population that aren't. And we know that a number of AI applications in dermatology didn't perform well once they were applied to people of different race and colour. In fact, I think the American College of Dermatology has put out a paper a little while ago looking at what are we going to say are the standards of images that are going to be used for AI applications. And there was a number of things technically around, as I said, quality of images, getting rid of smudging and so forth, but it was also to make sure that the representation of racial groups and different ethnic minorities is there as well.

MIC CAVAZZINI:               Eventually you get to a point where you want to validate your model’s performance in comparison to human diagnosticians. In this example, it was conducted using over 100 images from each disease classification—diagnoses that had been corroborated through biopsy. Twenty-one board certified dermatologists were presented with the same image set and asked whether each lesion could be safely left or should be investigated further.

The machine performed as well as the best dermatologists. Sensitivity and specificity were both over 90 percent and the area under the ROC curve came to 0.94. You can go back to episode 70 to learn how a receiver operator characteristic curve is plotted. But all you need to know now is that the area under the curve is a simple metric for the discriminating power of a test. An area of 0.94 is pretty impressive, but results from ‘in silico’ validation don’t necessarily translate to real world settings. For a start, you want to make sure that the computer vision model can handle images from different sources. For example, there were differences in performance when dermoscopic images alone were used. Now imagine you’re talking about AIs to analyse MRI scans, which might come from machines with different magnets and different detectors that greatly affect the image characteristics. And there’s a big difference between performance in an experimental setting with no consequences, and clinical utility in real time. As we’ll hear in the next episode, that really depends on where the AI fits into the decision sequence and how its output is presented to the clinician.

Now, only a few of you listening will be directly responsible for diagnostic imaging involving machine learning guidance. But pretty much every speciality has a handful of go-to clinical decision aids used for risk assessment and prognosis. So what advantages can AI bring to these statistical problems? Classical decision aids are usually very simple algorithms. For example, the CHADS2 for estimating stroke risk in patients with atrial fibrillation only has five entries and these are all yes or no answers that can be scored by hand. Previous history of diabetes or hypertension or age over 75 each add a point to the risk calculator. But we know already that hypertension doesn’t just kick in at 130/80 mm Hg. Population research tells us that risk of cardiovascular events increases on a pretty linear gradient over the whole range of blood pressure.

But that’s just looking at one variable. If you focus on those patients with diabetes, that line would likely be steeper. And in some study cohorts, it’s been reported that the risk is actually on a non-linear J-shaped curve. There are more nuanced algorithms than the CHADS, like the seven item Framingham score, which weights the risk incurred by high blood pressure even more if that’s the value achieved after treatment. Similarly, cholesterol levels are bracketed into five tiers and then weighted differently across 5 age groups. More modern derivations like the Reynolds Score, the QRISK3 and the American College of Cardiology risk calculator estimate in much the same way the ten year likelihood of a major cardiovascular event.

But they all have the same limitation. According to the authors of a 2017 paper in PLoS One; “There remain a large number of individuals at risk of CVD who fail to be identified by these tools, while some individuals not at risk are given preventive treatment unnecessarily. For instance, approximately half of myocardial infarctions and strokes will occur in people who are not predicted to be at risk of cardiovascular disease. All standard CVD risk assessment models make an implicit assumption that each risk factor is related in a linear fashion to CVD outcomes. Such models may thus oversimplify complex relationships which include large numbers of risk factors with non-linear interactions.”

The researchers from the University of Nottingham wanted to see if machine learning could do better at handling these messy relationships than the American College of Cardiology algorithm. On top of the eight clinical entries into the ACC decision aid, they filtered out another 22 variables from the general practice records of 378,000 patients. All patients were free of cardiovascular disease at the baseline date, and after ten years of follow up almost seven percent of them had experienced cardiovascular events. The majority of the patient records were used to train four different machine learning models. Remember how in the earlier example we were labelling every individual image as a case or not a case? For these regression analyses we’re instead relying on accurate case coding in the electronic medical record.

Around 83,000 patient records were held back for validation purposes. From that cohort, the ACC algorithm accurately predicted 63 percent of incident cardiovascular events. Specificity was 70 percent, so this gave an area under the ROC curve of 0.73 for the classical decision aid. All four of the machine-learning models did significantly better than this, with some close competition between the logistic regression model and a gradient-boosting ensemble. But the winner was a neural network that had an area under the curve of 0.76. I asked Professor Scott how significant a 0.03 improvement in predictive performance would be in the context of ten-year cardiovascular risk.

IAN SCOTT:         Yeah, it's not a huge increase, but it's certainly significant. I think once you're getting differences in the area under the curve of about 0.05 or more that starts to become clinically useful. I mean it certainly reclassified up to about 7.6 percent of patients into a different group. So that is clinically meaningful. Particularly when you're applying it to a very large population. So certainly, then the number of people involved starts to become significant.

               Surprisingly, the top ten risk factors driving the prediction made by the neural network were quite different from those that feed the American College of Cardiology algorithm. Only gender, age and smoking status made it onto both lists. The AI considered atrial fibrillation, kidney disease and corticosteroid use to be important drivers but downplayed classic variables like cholesterol and treated blood pressure and diabetes status. Is this telling us something we didn’t know before about cardiovascular risk?

IAN SCOTT:         I think some of those things, some of those factors that the model picked out, we were sort of aware of but they hadn't actually been really quantified and fully integrated into the Framingham formula. I mean, we need to remember who actually invented the Framingham formula? Well it was back in Framingham in the 1960s.

MIC CAVAZZINI:               What information did they have available?

IAN SCOTT:         The factors they thought were significant, and also, that they could measure pretty reliably in every individual. And they had good measures of that, they were objective data in terms of blood pressure, age, sex, etcetera. Okay, the benefit of machine learning is that it can take a large number of variables. You can develop regression models up to a certain number of input variables and after that it starts to become very murky. Machine learning, I think the benefit of that is that it can take a whole lot of features, a whole lot of different variables and try to work out okay, what effect each of these having on the outcome of interest? So, I wasn't necessarily that surprised when it picked up these things. Because, yes, we know about, for example, chronic kidney disease. Well, we know that chronic kidney disease as a risk factor for cardiovascular disease…

MIC CAVAZZINI:               We just didn't know how much.

IAN SCOTT:         That's right. What kills people with CKD? It's cardiovascular disease. Similarly, socioeconomic status. Well, that's a proxy for lots of other things; i.e. less exercise, poor dieting, obesity, and just poor preventive health. So it's not necessarily that surprising, but okay, yes. It just emphasizes, and I think this is the usefulness, that that is an important factor that we as a society need to think about. So we can talk about statins and other things and cholesterol, but if you're from a poor socioeconomic group, or for example, you're from a First Nations group—because that's another group that has a very poor cardiovascular history, and moreso than you would predict just based on the Framingham risk factors. So there's obviously a genetic and racial component there in terms of your vulnerability to cardiovascular disease. The model just simply is bringing this out.

MIC CAVAZZINI:               Well, the neural network was the only one of the four models or five models that brought up ethnicity, socioeconomic status and severe mental illness up as predictors of CVD. In fact these three factors made the top five drivers.  The authors didn’t know what to make of that but acknowledge that one danger of feeding too many variables into the model is it can lead to “over-fitting” and “implausible results.” What does this mean and how would you tell if it was implausible?

IAN SCOTT:         I think that what happens with overfitting is that once you then apply it to another data set, particularly an external validation set, now you’re applying it to a population of patients that are distant in time or geography, and that the algorithm hasn't been exposed to before—and that's when you find okay, this is this model is overfitting, because it's now giving us quite ill-matched results. I think the other thing is there's always going to be a bit of a clinical reality check. In other words, if any algorithm is starting to identify factors that we feel are very counterintuitive, you know, that doesn't make any sense whatsoever, there's no biological mechanism why that should account for, then again, we would have to just—well let's have another look at this and just see whether it's overfitting things that really have absolutely no clinical value at all.

MIC CAVAZZINI:               The authors were quite pleased with their results and they say that by contrast a more rational approach, such as, in the past, people have tried to incorporate scores for C-Reactive protein into their algorithms which yielded a negligible improvement in predictive power of those. So would it be fair to say that machine learning is so powerful because it’s agnostic to a priori hypotheses?

IAN SCOTT:         Yes, that's right, it is. And you're right, and it's not driven by any hypothesis. And that's why on that basis, machine learning algorithms can't necessarily take spurious results either. It has no reasoning power, it just simply learns patterns.

MIC CAVAZZINI:               In a webiner that you and Stanford Professor Jonathan Chen gave, he said that an AI trained on unfiltered medical data might well tell you “patients who had a visit from palliative care physician were more likely to die.” It’s a true association of course but not a useful one. models don’t always perform as well as you hoped for. One study published in JAMA cardiology was trying to improve on prediction of 30-day readmissions in patients admitted with heart failure. But none of the six the models they trained up did any better than the classical decision aids. The authors suggested that the standard regression models they were using may have been stumped by non-linear relationships in the data. So they were all using standard statistical models, but if they had taken the neural network used in the previous paper we were describing could that have potentially handled the nonlinearity it this data set? 

IAN SCOTT:         Good chance, it may have, I guess. It deals with nonlinear high dimensional data. So I think that that is the benefit of it. But it’s just that, again, it needs a little more data to then be able to give you a correct model. So it comes down to well, to what extent were the authors limited by how much data they had available to them?

MIC CAVAZZINI:               Interesting. So yeah, because it starts out to be less specific than the classic statistical models, it takes a bit longer, a bit more training to get good at what you want it to do.

IAN SCOTT:                         Yeah and with a standard regression model, you can say, right, well, I've got so many variables I'm going to put into the model, and I have so many outcome events, and I have a certain population. Now, we can work out, to build a model that is going to be statistically robust from an accuracy point of view, these are the numbers of cases that I'm going to have to put in the number of patients I need to involve. No one really has worked out exactly what a sample size is for deep learning—well, people have tried and there’some provisional guidance. But, you know, is 10,000 or 20,000, or 30,000 or—I mean, how, how big a set do you need? People just going to rule of thumb, “Well, the more complex the model, the more data I probably should have.” But exactly how much data I need, well, that's a bit of an open question.

MIC CAVAZZINI:               To give one final example from a 2016 paper in the journal Academic Emergency Medicine.  Risk stratification for patients admitted with sepsis is important for triage and allocation of resources. And decision aids like the MEDS, the REMS and the CURB-65 were shown to perform with area under the curve measures between 0.71 and 0.73. But a simple machine learning model trained on over 500 clinical variables had an AUC of 0.86. The authors reckoned that would have meant 370 additional patients correctly identified in a year by the Yale University health system. I notice, compared to the classifier models we were talking about before, the benchmark comparison now is no longer “Does the AI perform as well as a human diagnostician” but “Does the AI perform as well as the classical decision aid.”

IAN SCOTT:                         You could also argue, first of all does the decision—even before we then use deep learning method—does the decision aid actually do better than a gestalt judgment by a clinician? And having said that, there was a recent review that I've looked at that shows, in fact, that a lot of these decision rules aren't much better than clinical judgment. So I think we just need to keep that in mind that before we get too—saying we can always do better than the clinician—well, perhaps, in some cases, you can't. But also, a lot of clinicians don't use decision aids. You referred to MEDS and REMS and CURB-65. Well, we did a study in this hospital of people coming in with pneumonia, and CURB-65 was not consistently used, okay? So people were just making their own just judgment as to whether their patient needs to be admitted, and what antibiotics etcetera needed to be given. So we need to just keep that in mind.

I think the second thing is that, okay, if you—what we should be doing in most of these studies is comparing the model against best clinical judgment. But that comparison has to be, again, in a real world setting. And a lot of the studies that are compared clinicians with the model really haven't done that, in a rigorous sense knowing how people actually gather data and processes.

MIC CAVAZZINI:               Interesting what you say about the decision aids. Could it be said they’re so over simplified, that they're actually considering fewer variables than even the clinician can do just with their intuition?

IAN SCOTT:         That's right, exactly. So if that model was then integrated into an electronic medical record, and then could give an output on the screen to the clinician, saying, “Well, this is the risk of mortality in this patient, and we have this degree of confidence in it.” Okay, then clinicians wouldn't ignore it, I think they’d totally take that into the judgment. They wouldn't necessarily rely on it exclusively, but certainly it would certainly inform their decision making. I think that's where we ultimately want to get to, but we've got to actually do the proper clinical studies to show, clinician plus tool is better than clinician alone.

MIC CAVAZZINI:               Last point. An interesting observation from these researchers was that models like this can always be retrained on the EMR data from specific populations they are serving in different places, or they can even adapt to demographic changes over time. They say that by contrast, traditional guidelines often take years to be updated by the slow-turning cogs of professional consensus. Do you agree with that?

IAN SCOTT:         Yes, I think I think We know that guidelines in the past sometimes have been too generic. So in the guideline group that I'm working with now, to revise the acute coronary syndrome guidelines, we're very much aware that we need to qualify these recommendations according to perhaps geography, according to particular racial groups, and in particular, First Nations people, for example. Women also, in the past to really were not recognized as having acute myocardial infarction because they present in a different way. So yes, as we go through we find more and more that there's a bit that we didn't know about or account for we have to fine tune everything.

Perhaps if we train these models on local populations and make sure that they're well and truly calibrated, then that may be more useful than a more generic guideline, which is trying to cover a much larger and diverse population. Over the lifecycle, how do we make sure this model continues to keep up with changes in demographics, changes in clinical practice, changes in the way we actually diagnose and treat patients so that it still remains accurate in a population that we're treating so. So no matter what you're talking about guidelines and machine learning models, we've got to continue to fine tune and continue to update them.

MIC CAVAZZINI:               Many thanks to Ian Scott for guiding me through this tricky material. You can read up on it in more detail in his article “Demystifying machine learning: a primer for physicians”  or another useful explainer in the BMJ titled “Clinician checklist for assessing suitability of machine learning applications in healthcare.” I’ll be speaking with Professor Scott’s co-author on that in the next episode. As Professor Enrico Coiera says, very few AI decision aids make it out of the lab and fit in with the ergonomics of the clinical workflow. And as we’ll hear in the third episode, regulators and courts are still getting their heads around this technology.

You can find plenty more articles and webinars at our website racp.edu.au/podcast. Click on the episode link and you’ll also find a complete transcript with every research academic citation embedded into it. I’ve also linked to some other recent additions to the RACP Online Learning platform, such as the Decoding CPD elearning course and a Medflix video called CPD Simplified. These should answer most of your questions about the myCPD framework, built on Performance Review, Outcome Measurement and Educational Activities like listening to podcasts.

I also want to give a plug to the RACP Online Community. Every member of the College already has a user profile and you can login on your computer or smartphone. Just search for “RACP-the ROC” in your app store. It’s basically a networking platform for physicians, where search for members by location or by specialty and find mentors or mentees. There threads open to all members, and others defined by speciality groups. I think it would be a great place to start a journal club space to carry on the themes of discussion from Pomegranate Health. But the ROC is entirely member driven, so it’s up to you to get the ball rolling.

You can send feedback and ideas for future stories directly to me via the address podcast@racp.edu.au. I don’t do this alone, there’s a group of members and fellow travellers who generously vet these podcasts before you hear them. I’ve credited them all by name at the podcast website, as well as the musical artists who composed the great tunes you’ve heard, all licenced from Epidemic sound. This podcast was produced by the waters of the Gadigal people of the Eora nation. I pay respect to their storytellers who came long before me. I’m Mic Cavazzini, thanks so much for listening.



Be the first to comment on this Podcast!

Thank you for posting your comments

15 Jul 2024
Close overlay