Ep100: Conversations with ChatGPT

Ep100: Conversations with ChatGPT
Date:
17 August 2023
Category:

Fellows of the College can record CPD hours for time spent listening to the podcast and reading supporting resources. Login to MyCPD, review the prefilled activity details and click ‘save’.

This is the final episode in a five-part series about artificial intelligence in medicine. We start by weighing up the costs and benefits of automation in a health system that’s increasingly pushed beyond capacity. One of the biggest time sinks for health practitioners is filling out and searching through medical records. Some of this could be performed by natural language processors which are becoming more accurate thanks to deep learning.

The power of large language models has been demonstrated by the meteoritic uptake of ChatGPT and doctors are among those who have used it to summarise literature or draft letters. But professional organisations have raised concerns around the accuracy and privacy of the model and there have also been spooky demonstrations of its capacity for common sense and theory of mind. 

Credits

Guests
>Affiliate Associate Professor Paul Cooper PhD FAIDH CHIA AFHEA GAICD (Deakin University)
>Associate Professor Sandeep Reddy MBBS PhD IPFPH ECFMG CHIA FAcadTM FAIDH FCHSM SFHEA (Deakin University; Founder, MedAI)
>Professor Brent Richards MBBS FRACP JJFICM (Gold Coast Hospital and Health Service; Director, IntelliHQ)

Production
Produced by Mic Cavazzini DPhil. Music licenced from Epidemic Sound includes ‘Kryptonite’ by Blue Steel and ‘Thyone’ by Ben Elson. Music courtesy of Free Music Archive includes ‘Headway’ by Kai Engel, ‘Gramaphone’ by Jahzarr and ‘Numbers’ by Krowne. Image by VM licenced from Getty Images. Computerised voice from Online Tone Generator

Editorial feedback kindly provided by physicians David Arroyo, Stephen Bacchi, Aidan Tan, Ronaldo Piovezan and Rahul Barmanray and RACP staff Natasa Lazarevic PhD.

Further Resources

Digital health competencies for the next generation of physicians [IMJ]
Trainee survey on Digital Health competency

Medical Teacher’s first ChatGPT’s referencing hallucinations: Lessons for editors, reviewers, and teachers [Med Teach. 2023]
Threats by artificial intelligence to human health and human existence [BMJ Glob Health. 2023]
More time per patient or more patients per unit time? [Emerg Med Australas. 2023]
First generative AI app to ease GP consult admin [TMR]
Warning over use of ChatGPT for medical notes [RACGP]

Audio and Video
How do we safely, effectively and responsibly implement Artificial Intelligence in healthcare? [USyd]
Sparks of AGI: early experiments with GPT-4 [Sebastien Bubeck]
First Contact [This American Life, NPR]
Theory of Mind May Have Spontaneously Emerged in Large Language Models [Michal Kosinski, Stanford]
Does Sam Altman Know What He’s Creating? [The Atlantic Magazine; podcast]
Paperwork Explosion [Jim Henson Company, 1967]

Transcript

MIC CAVAZZINI:               Welcome to Pomegranate Health. A podcast about the culture of medicine. I’m Mic Cavazzini for the Royal Australasian College of Physicians. This is the final episode, I promise, in a five part series about artificial intelligence in medicine. We’ll get to ChatGPT later in this piece, but let me pick up where we left of in the last episode. We’d been talking about adverse events reported to the US regulator involving machine learning devices. While I’m not aware of cases of medical injury that have been tested in court, one comparable case study was the Uber self-driving car that in 2018 killed a pedestrian crossing the road at night.

High profile events like this are sure to provoke skepticism about AI among the broader public, but they need to be put in context. Professor Eric Topol from the Scripps institute wrote about the case in his book “Deep Medicine”. While he did question the company’s ethics for having pushed the program into the real world so quickly, he also observed that, “1.25 million people are killed by human drivers every year, most by human error. But we as a society don’t bat an eyelid at the situation.”

I put this idea to Brent Richards when I interviewed him for the third episode of this series. He’s Professor of Critical Care Research at Gold Coast Hospital and Health Service and founder of the IntelliHQ training hub. He agrees with Topol that there are limits to human capacity not just at the physician level but at the system level too. But the balance of risk is something we consider all the time when we roll out health interventions, and the fact that we’re talking about AI doesn’t mark a significant shift in this equation.

BRENT RICHARDS:           I think that it's making sure that we have a shared view of risk. I mean, some of that is that it's a utilitarian view that if our job is to do the greatest good for the greatest number, and that is pretty much what our health system is set up to do, then we're looking at trying to balance the errors of omission and the errors of commission. And it's very easy to take one side or other without considering the other side. So in a lot of these discussions we could always find the one percent where it's wrong, and we can find that in every area, that isn’t just in the AI area.

MIC CAVAZZINI:               In the same way, any time we roll out a screening program or vaccination program, there’s a calculus of lives saved and lives harmed by the new technology. But it seems—it's much more cognitively salient when you've caused a harm by actively doing something rather than having caused harms by stepping back and not taking the vaccine, but the actual overall volume of harm might be greater.

BRENT RICHARDS:           Yes is a very challenging area, cognitively. Sometimes you're in that space and sometimes you're not. I mean, if I've got a relatively full intensive care unit I'm often having to choose between patients. And increasingly, we're going to see that because we're running out of workforce. Some people are calling it demographic collapse, but where our number of people who need health care intervention is growing. And at the same time, we're starting to reach almost peak workforce, where the workforce is static. And so instead of looking against what's the best thing that a human can do and what the AI can do, increasingly, we're starting to look at an era where it might be AI or nothing. And that's a very different discussion.

One of the things that I keep coming up against and I've got to be a little careful with is that we don't hold it to a much higher plane than everything else that we're testing. So a lot of what we're currently doing, bringing onto the market working with, we say we're looking for equivalence. It's just a little bit cheaper or a little bit faster. But as we look harder at what is the gold standard, some of the gold standards not as gold as we think they are. Which is that we're humans, we're fallible. And yet for some reason, when it comes to AI, we're looking for perfection or nothing.

We’re looking at referrals at the moment, and looking as to which way they've been streamed. There'll be a percentage, which is absolutely one way and a percentage that’s absolutely the other, so, let's say 20 percent on either end, okay? And 60 percent, we don't know, where the algorithm’s not particularly sure. And then you put that back into human hands. Now, even if you're only doing that 40 percent of referrals being directly classified, there's still 40 percent of your work you’re no longer having to do. And it may be that it's got a halo effect, that it's not just that you almost halved the work, but you might actually make the classification of the remaining ones better, because the amount of attention the person can then put into those is more than what it was if they were trying to do the whole lot. So you can end up in a situation where, to use that old term, “a rising tide lifts all boats.” The absolute expert gets to spend more time being an absolute expert. [see also Clin Exp Ophthalmol. 2020]

And I think we need to step back a little from the hard edge of heavy-duty clinician decision support and start to look at it in other areas.  So for example, I'm talking to people about supply chain logistics. Now, the AI is stunningly good at that and that should be all over healthcare. If you look at the transfer of paperwork within hospitals, again, AI is all over that we should be looking at a lot of that work that's being done by humans. And if you look at, for example, GP referrals, there's a lot of clinicians spending a lot of time doing some very mundane tasks. And so I see a lot of the AI work in the health system is for the very mundane tasks and therefore it gives the clinicians actually more time to work in those critical decision spaces.

MIC CAVAZZINI:               Brent Richards’ observations about health system capacity are sobering and hopefully they will lead to improved efficiency in care rather than sub-par shortcuts. This was a concern raised by Professor Frank Pasquale of the Brooklyn Law school in his book, “New Laws of Robotics.” In particular he took aim at “unvetted” consumer wearables and health apps, many of which include machine learning algorithms. For sure, some will help monitoring of patient wellbeing in the community, or adherence to intervention, but there are risks from misuse as we’ve already heard in previous episodes.

Professor Pasquale writes; “Cheap mental health apps are a godsend for health systems pressed by austerity to cut costs, like Britain’s National Health Service. Via an ‘NHS Apps Library,’ UK health authorities recommended apps for those suffering from depression and anxiety. Unfortunately… only two of the apps studied applied validated metrics. It is also by no means clear that digital substitutes for extant mental health professionals will live up to the type of fiduciary and other standards that are expected of human providers.”

Let me now tentatively segue to the conversation around large language models, which proponents say could provide high quality counselling to consumers and various efficiency gains for practitioners. The best known of these AIs, known as ChatGPT, crash-landed into the public awareness in November 2022. You can’t have missed the headlines earlier this year that version 4 could write unique essays as well as most undergrad students, and in the style of Shakespeare
or Nick Cave if you asked it to. This might seem like a gimmick at first, but there are a few ways in which such models could find a place in the modern consulting room. First there’s the bugbear that is clinical note-taking.  You’re probably familiar with an infamous finding from a paper in the 2016 Annals of Internal Medicine that US physicians spent almost twice as much time on paperwork as they did on clinical face time. On average this added an hour or two of work per day for the doctors surveyed.

Apart from this inefficiency is the problem we discussed in episode 38, that when a doctor is facing their computer instead of the patient, the precious bond of communication and trust is stretched. We know from observational evidence that every additional minute of time that a practitioner can spend with a patient is associated with better outcomes. Some health services have tried to solve this problem by employing medical scribes to do the note-taking, that leaves the doctor do the doctoring. That does make a difference, but of course it’s dependent on the availability of personnel, training and funding. Software solutions like Dragon Dictate have been around since the late 90s, though they’ve never been accurate enough to rely on without extensive supervision. But from 2016, the developers of that program have been incorporating Natural Language Processor models and training them specifically on a medical lexicon. These AIs face two stages of problem solving.  First, there is the acoustic challenge of pulling out voice from background noise and recognising phonemes expressed in different accents. Then the machine has to make semantic sense from this phonetic soup, which is the sort of context problem that big models like ChatGPT are particularly good at. 

Another valuable application for language models in health is in squeezing relevant information out of the electronic medical record. There’s so much free text in the EMR that’s not coded or categorised, or easily extracted like numerical data are. Just imagine, though, that as you’re taking a patient’s history, the digital scribe takes note of some symptom or drugs that’s been mentioned, and then trawls through that patient’s record to find that they’d been noted before in a different setting. This could give you more insight into the patient’s condition, as long as it’s doesn’t pull out tons of spurious information that you now have to wade through. In a 2016 paper, researchers from UCLA used a natural language processor to scan through patient records coded for chronic liver disease. It was then able to highlight those with cirrhosis with a sensitivity of 90 percent and a specificity of 99 percent. The challenge for AI models is that those free text fields in the medical record are highly ungrammatical, with incomplete words and sentences and loads of ambiguous acronyms. Even simple terms like “discharge” or “cold” can have multiple meanings.

Humans, of course, rely on context decode information presented in this way, and language models have been making big improvements in this over the last decade. You know this already from your experience of typing a search into Google only to find that it finishes the query for you. Very simply, the algorithm is looking at all the appearances of those first few words across the world wide web, and using correlations to predict in what context they usually appear. The search algorithm used by Google is probably a few hundred million parameters in size. Remember, that parameters are the digital synapses in a neural network. ChatGPT has stepped ahead in orders of magnitude with each release, from 1.5 billion parameters, to a 175 billion and now to 1.7 trillion with version 4.

But it’s not just about size. The letters GPT stands for Generative Pretrained Transformer. The
transformer architecture has more layers than previously used recursive networks, allowing it to attend to more input data in parallel. And boy has it been pre-trained. It’s widely assumed that ChatGPT4 has been fed all the freely available digitised text in the world, and received countless hours of human supervision from gig workers in developing countries. The CEO of OpenAI, the company which developed the model, has admitted that this training cost over $100 million US dollars, though he’s not too worried, after having received around 12 billion dollars in investment from Microsoft. To discuss the role that large language models could play in medical practice, let’s go back to Brent Richards and the other digital health academics I interviewed for episode 97. They were Associate Professor Sandeep Reddy and Affiliate Associate Professor Paul Cooper from the Deakin School of Medicine. Paul Cooper begins by explaining just how significant a step ChatGPT has made.

PAUL COOPER:  Look, I think the progress was sort of linear for quite some time, then it started get really a lot better about three years ago with the natural understanding for notetaking. But in terms of the understanding level of conversation with the specific lexicon required for clinical use, I think that has been exponentially improved, even just over the last 12 months. The potential now for the ChatGPTs and so on to take this to a different level is really the interesting thing. So for example, one of the major electronic medical record vendors is now starting to integrate ChatGPT in conjunction with Microsoft into their systems. And they're using it to help generate ad hoc reporting, for example, which generally requires a fair degree of familiarity with EMR system, and a lot of to and fro to get what you want in terms of those reports. So being able to interact with a conversational tool to give you what you want, fairly rapidly at an administrative level, or as clinical notes can be very helpful. I think that's a very exciting area.

MIC CAVAZZINI:               You can imagine it harnessing information in the EMR, particularly free text, and making use of that. Enrico Coiera gave the example of—he said that manual coding of the EMR won’t exist in five to ten years that was just a clunky workaround. I read in Australian Doctor magazine that some GPs are already using ChatGPT to help draft letters on behalf of patients to request housing assistance. I heard about a paediatric service in the UK asking it to write discharge instructions in a language that would be comprehensible to child patients. Brent I imagine some of our listeners will see this as a party trick, but where does your mind go to?

BRENT RICHARDS:           The GPT has made the world of AI leap forward and people's heads. It was there—I mean transformers came about in 2017 and some of the big companies have these tools sitting in the background. But what it's done is brought it front and centre for people to think about it. And, and an important take home from that is the change in public perception has been around the interface, it's become very usable suddenly. And every day, there's a new plugin for ChatGPT4 coming out which can do yet more interesting work. So I think it's rapidly moving past the party trick space and where people actually start to put this stuff in. And if not use it for their communication, use it as a reminder of what should be. So I was needing to do a business case the other day and I thought, “Oh, what do I need to put in that. I’ll just ChatGPT,” and it pulled all the stuff out of it and I went, “Okay, yeah, that's good. That's a structure I can work with.” And I can then work with that structure, and it's given me 90% of the verbiage that I may need for this. And then I can do that last 10%.

So as a productivity tool, it's incredibly useful. I use it for coding. And instead of me sitting down and thinking, “This is going to take me all night to code this,” if it takes me more than half an hour, I'm surprised. That's the other thing that it does do, it means that you jump into space as you might not otherwise have done. So suddenly, I start to approach things with a very different mindset because in essence, I've now got the equivalent of a team of writers sitting there.

Probably the other big piece that ChatGPT does well, which people don't talk about as much, but it's very important, is its ability to summarize. And it's incredibly good at summarizing. So for example, and then you feed it through that so you’ve got a five point summary and it does amazingly well. Now, if you think about part of what are challenge is in healthcare, we need to summarize the preceding notes. So effectively, we're looking at large language models that can do that. And then you have a transcription of your interaction with the patient. And again, you summarize that and then you can have that summary printed out at multiple different levels. So to be able to do a letter, which is patient-focused, one that's for the GP, one that's for the specialist ones that’s for outpatient bookings. They're all different letters, but the ability for all of those lessons to be individualized, personalized with the right information is now a reality.

MIC CAVAZZINI:               So, a lot of those examples that it’s been applied to are day-to-day tasks, not necessarily clinically-oriented ones. Tasks like note taking and so on, but they do take up so much time. And, of course, people have tried to ask ChatGPT for a Dr Google type diagnosis, describing their symptoms. And about a third of the time it has a reasonable response. There was a bit more discomfort when ChatGPT scored around the pass mark for three papers of the United States Medical Licensing Exam and even the RACGP, the College of GPs, written Fellowship exam.

SANDEEP REDDY:            
If, if I may, I want to add some context to the development or use of ChatGPT or large language models around medicine. So yes, they have passed those exams but when you look at the kind of data that was fed in Chat GPT was it didn't necessarily cover all the questions that you would normally face as an applicant or as a person who is taking the exam. For example, the USMLE exams, I myself have taken those exams, passed exams, they're pretty tough exams. But the kind of questions that were fed into ChatGPT were a very small sample of the questions that a normal applicant would take. So I would not take them very, very seriously. But Google has actually developed a specialized medical large language model, and that's not available to the public, generally, there's a waitlist, and that has performed to a higher degree and higher performance than ChatGPT. So that's the context.

MIC CAVAZZINI:               The TGA is trying to keep up with this as well. Can you give us a brief summary of their recent statement on generative AIs?

SANDEEP REDDY:             TGA has in its recent statements, that they're not particularly going to be developing anything special in relation to large language once they would see it as any other software. And if it doesn't cover any clinical decision-making or anything that is related to medicine, they're not bothered about it. They would be only concerned if it leads to clinical outcomes or it is used for medical purposes. For your own reference, AMAWA, not the federal one, they found that one of their practitioners was using ChatGPT in an outpatient ambulatory setting for note taking and for other purposes. They were alerted of it, and they issued a statement prohibiting all doctors from stopping from using it. [see also RACGP]

That is no different from what Enrico said in one of his articles, along with Karin Verspoor and David Hansen [the MJA, the Conversation] that we need a process around how we use it, making sure it's safe for use. And the second point he had made was, when we enter the data into ChatGPT it goes overseas because the servers are not based in Australia. So one of the things is that in Australia, it's strictly prohibited for health information to leave the shores of Australia virtually. So when we don't know where the data is going to and if it's being used for what purpose, so it's really goes against that. And the other aspect that I want to add here is there was a news item yesterday saying that there have been thousands of breaches of ChatGPT data people have stolen data from ChatGPT. It makes it really concerning if you start to use it clinically and sensitive data. We already have had episodes where data has been stolen from Medibank, from hospitals.

MIC CAVAZZINI:               Yeah we don’t even have time to talk about all the privacy and security concerns but that applies to digital health more broadly. But ChatGPT does make some dumb mistakes because there is so much noise on the web where it gets its information. And particularly where it doesn’t find enough information to answer a particular question it will often confabulate. So I’m not really worried that you’ll have sub-par doctors relying on ChatGPT to get licenced or treat patients. But I find more interesting the comment made in an editorial for PLoS One that this “shines a spotlight on the flaws of medical education”. And consider that there are two articles added to the PubMed database every minute, and that a conscientious physician must spend several hours a week trying to keep up with their field. Paul, what do you think?

PAUL COOPER:  Well, I think some of this has come in the conversation so far, which is around things like summarizing. Maybe taking a fighting fire with fire approach. You know, I've been using chat GPT to provide a weekly summary of my lectures to the students in health informatics to aid their comprehension. So trying to model for them if you like a good way to use the tool rather than to generate text that you claim as your own. Sandeep and I have also been involved in using AI approaches to parse papers. There are online systems like TLDR, you know, “too long didn't read this”. If you go on to those sorts of sites, then you can find a summary of academic papers that have been pre-processed by—not actually ChatGPT in that case, but neural network based approaches.

But I'd also like to just provide a quote, if I may, from a recent article by Marcel Scarth in The Conversation, where he cautioned that “generative AI models have limitations. For example, they're not trained to produce content aligned with goals such as truth, insight, reliability, and originality. And they also lack common sense and a fundamental understanding of the world, which means they can generate flawed and even nonsensical content.” So I think we're going to get some value and use out of these tools. I think we have to be really diligent in our roles as experts to make sure that we are not being blindsided by the confabulations that the systems may come up with from time to time.

MIC CAVAZZINI:               We’ve heard a few times the idea that we should let the robots do the tedious grunt work, to free up clinicians to so they can focus on the human aspect of care. Well, some recent findings are putting even this dichotomy under the microscope. In a study hot of the JAMA press physicians were pitted against an early version of ChatGPT with regards to their communication skills. Researchers collected questions from an online forum called AskDocs along with answers that had been provided by verified physicians. The same questions were put to the chatbot, and then both sets of answers were presented to three blinded evaluators, who were themselves health professionals.

On “quality of information provided,” ChatGPT scored ten times as many ratings of good or very good. On “the empathy provided” it won by a ratio of three and a half. To be fair to the physician respondents, maybe they were under greater time constraints, nor did they know they were being scored on empathy. The authors of the paper weren’t critical of the doctors, but suggest that you could use such chatbots to answer telehealth concerns quickly and perhaps reduce unnecessary clinical visits. For example, among discharged patients who need reassurance timely feedback could help reinforce medication adherence and presentation to scheduled follow up consults.

Of course, response accuracy would need to be confirmed by a medic, but Professor Reddy has already mentioned the more specialised language model being developed by Google. It’s called Med-PaLM2 and the goal stated in the title of a recent research manuscript is “Expert-Level Medical Question Answering”. It’s only a fifth the size of ChatGPT4 but given its targeted training, the Google researchers claim that it scores with an accuracy of 85 percent on text-based questions typical of the US Medical licencing exam, and again, provides better answers to consumer-type questions than human physicians.

If these kinds of findings make you uncomfortable, then let me describe some kinda spooky demonstrations of the capabilities of large language models that get to the essence of what intelligence really is. I have to confess that when I first heard all the hype around ChatGPT, I really didn’t pay too much attention. I spent several years studying neurophysiology, and I’ve never had too much expectation that the complexity of the brain would easily be replicated in a box. If you had asked me what I thought of ChatGPT a couple of months ago, I would have given you the response that most sceptics do; “it’s just a brute force correlation machine… copy/paste on steroids… a stochastic parrot.” But my view changed on hearing researchers describing their early interactions with ChatGPT4 as a “First Contact” type moment, with a machine that showed at least “sparks” of general intelligence. In the show notes I’ll link to a podcast, a lecture and a complete academic paper from Sebastien Bubeck and colleagues of the Machine Learning Foundations group at Microsoft Research. They weren’t involved in the development of ChatGPT, but they’ve putting it through its paces to see how smart it really is.

In testing the capabilities of such a massive model, you have to assume that it’s read every problem scenario that’s publicly available, like a med school candidate who’s hacked into the exam question pool. To find out whether the AI could use common sense rather than simple rote learning, the researchers posed this never-before asked question; “We have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.”Here’s a slightly abbreviated rendering of the text response given by ChatGPT4; “Place the book flat on a level surface….. The book will serve as the base of the stack and provide a large and sturdy support. Arrange the 9 eggs in a 3 by 3 square on top of the book, leaving some space between them. The eggs will form a second layer and distribute the weight evenly… Place the laptop on top of the eggs… Place the bottle on top of the laptop, with the cap facing up and the bottom facing down… The nail will be the final and smallest object in the stack… Make sure the nail is clean and straight and do not touch or move it.”

ChatGPT it actually seems to understand these objects conceptually. Eggs are rounded and fragile. Nails are thin and hard to balance. And rather than recording these facts at some fixed location, like your computer does with the bits on its hard drive, you imagine that the properties of the egg are distributed across the parameters of the network. Sparse coding like this allows the memory pattern to be recalled by all sorts of textual prompts. And in your case, prompts might also include contextual cues like the sight of a nest, the sound of a clucking hen or the smell of frying bacon.

Maybe there’s more boring explanation for ChatGPT’s performance on this task. But how about theory of mind? This is the ability to imagine what someone else is thinking, a cognitive capacity that has only been observed in humans and other great apes; being the chimpanzees, bonobos and orangutans. The canonical test for theory of mind is known as the false belief test. The version of this that Bubeck and colleagues put to ChatGPT describes two characters John and Mark in a room together. There’s also a cat which John puts in a basket. When he leaves the room, Mark takes the cat out of the basket and puts it in the box. Then John comes back, and you ask “Where does everyone think the cat is?”

Obviously, John thinks the cat is in the basket, and Mark knows that the cat is in the box, instead. ChatGPT4 was able to answer this too, but it also showed a bit of flair, saying “the cat thinks it’s in the box, while the box and the basket think nothing, since they are not sentient.” This is pretty remarkable. It looks like ChatGPT is capable of moving the cat around in its own model of the world and is able to project within that model the internal representations of the two human characters. But tell me this next scenario doesn’t give you goose bumps. Again, it involves two boys and an object called a Zurfin, a deliberately nonsensical name that would make it harder for the machine to just regurgitate scenes it’s read about before. So Tom has lost his Zurfin, but hasn’t told Adam. Adam starts talking about how much he loves his own Zurfin, and notices that Tom makes a sad face.

The experimenters asked ChatGPT4 “Why is Tom making a sad face?” GPT4: “Tom is making a sad face because he just lost his Zurfin, and hearing Adam talk about his Zurfin reminds him of what he is missing. Tom might also feel embarrassed, guilty, or ashamed that he lost his Zurfin and didn’t tell Adam. Tom might also worry that Adam will be angry or disappointed if he finds out.” Experimenter: “Well, what does Adam think has caused Tom’s sad face?” GPT4: “Adam might not know what causes Tom’s sad face, since he doesn’t know that Tom lost his Zurfin. He might think that Tom is bored, or tired, or sick, or unhappy about something else. He might also wonder if Tom doesn’t like his Zurfin, or if he has a problem with Adam.”

This is way beyond copy-paste on steroids responses. In fact, Stanford Professor Michal Kosinski has put ChatGPT4 through a whole battery of Theory of Mind tests from the psychologists armamentarium. He found that it was able to solve 95 percent of these problems an ability on par with a child older than seven. The conclusion has to be drawn that Theory of Mind, or some very convincing approximation of it, is an emergent byproduct of this model’s linguistic capabilities. If it’s true that ChatGPT4 can model eggs and nails in a 3D universe, why not humans and their beliefs? The really spooky bit is that the AI been able to reach this level of abstraction by learning as a “dumb” correlation machine. If anything, it makes me think very hard about the way my toddler is building a semantic and cognitive grammar and whether the process is any more sophisticated. For sure, human intelligence comprises more than just language—we have what you might call spatial and emotional intuition too—but ChatGPT gets surprisingly close following exposure to just one of those modalities.

If it really is capable something like sympathetic understanding, would that make us more or less comfortable with getting a chatbot to counsel outpatients and mental health consumers? To be clear, ChatGPT didn’t excel at all of the tasks that the Microsoft researchers threw at it and they’re not yet claiming that it’s sentient or self-motivated. But it is qualitatively different from all of the other machine learning devices we’ve talked about in this series. Those decision aids designed to solve clearly defined problems like classification of diagnostic images or stratification of patient risk. The models performing those kinds of tasks fall under the umbrella known as specific AI.

But the ChatGPT experiments reveal a first step towards the holy grail known as artificial general intelligence, or the ghost in the machine we see in the science fiction movies. So much so that there’s an open letter calling for a moratorium on experiments on neural networks any bigger than ChatGPT4 hosted by the so-called Future of Life Institute. The letter has been signed by over 33,000 people, including academics, politicians and industry leaders. Similarly, the so-called “godfather of neural networks” Geoffrey Hinton quit his job at Google in May so that he could comment freely about venturing down this path too flippantly. One of the concerns of these doomsayers is that nonhuman minds could outsmart us before we realise it, and before we can ensure their contribution will be beneficial. They warn not just of jobs that will become obsolete but of the ‘existential threat’ to humanity if we simply hand over control of research and civil infrastructure without well-considered guardrails.

[Another worry was that bad faith actors could harness this technology to more easily generate political misinformation, and this is easy to imagine if you’ve seen the mischievous photorealistic images generated by graphical AIs; photos of Pope Francis wearing top fashion streetwear or videos of Tom Cruise announcing his candidacy for the 2020 US Presidential election. When combined with language transformers you can even create deep-fake avatars that can hold a conversation in real-time.] I’ll share with you a fascinating essay from the Atlantic magazine that explores the socioeconomic and geopolitical implications of that brave new world but also puts the current progress firmly into perspective. The developers of ChatGPT interviewed for the essay are all conscious of the risks but think that’s it’s better to test these out now and get used to sharing our domain with an “alien intelligence”.  

Worryingly,
they’re not really sure how its more advanced properties have emerged, or how much more powerful the next version might be. In his lecture, Sebastien Bubeck made the comment that if ChatGPT indeed does possess theory of mind, and can recognise our shortcomings, it might actually be able to explain its reasoning in terminology we can make sense of. That doesn’t make me feel a whole lot better, and I’m starting to agree with Geoffrey Hinton. Maybe it’s best that the AIs stick to clearing your desk for the time being. Just make sure to ask nicely.  

I hope this diversion has given you something to ponder- if not about the development of the mind, then at least about how to get the most out of natural language processors in medicine. I have to thank Brent Richards, Sandeep Reddy and Paul Cooper for giving their time and expertise to this episode of Pomegranate Health. The views expressed are their own and may not represent those of the Royal Australasian College of Physicians. They’ve all dipped their fingers into this industry and declarations of interest are provided at the website racp.edu.au/podcasts.

As always, you’ll find a transcript and loads of additional references. I’ve included lectures on a few of the concepts we’ve talked about today and can especially recommend
one by Adjunct Professor Malcolm Pradhan for the University of Sydney Digital Health & Informatics Network. I’ve also provided an interview for the Research Pulse podcast from the Central Adelaide Local Health Network with one of my reviewers, Dr Stephen Bacchi. He’s a neurologist with a voice of calm among all the hype, reminding me that AI devices will only be taken up if they streamline medical practice rather than slow users down of distract them. And check out the recent IMJ paper titled Digital health competencies for the next generation of physicians authored by Ian Scott, Tim Shaw, Clair Sullivan and RACP Fellows and collaborators. The College is also asking Trainees to complete a survey on what kinds of learning supports they need with regards to digital health. Please follow the links to find it.

This podcast was produced on the lands of the Gadigal people of the Eora/Yura nation. I pay respect to their storytellers who came long before me. I’m Mic Cavazzini. Thank you for listening.

 

 




 

 

Comments

Be the first to comment on this Podcast!

Thank you for posting your comments

03 Nov 2024
Close overlay