The patient was a 39-year-old woman who presented to the emergency department at Beth Israel Deaconess Medical Center in Boston. Her left knee had been hurting for several days. The day before, she had a fever of 102 degrees. She was gone, but she still had chills. And her knee was red and swollen.
What was the diagnosis?
On a recent hot Friday, medical resident Dr. Megan Landon brought this real case to a room full of medical students and residents. They came together to learn a skill that can be devilishly difficult to teach: how to think like a doctor.
“Doctors are terrible at teaching other doctors how we think,” said Dr. Adam Rodman, an internist, medical historian and event organizer at Beth Israel Deaconess.
But this time, they could ask for help from an expert to come up with a diagnosis: GPT-4, the latest version of a chatbot released by the OpenAI company.
Artificial intelligence is transforming many aspects of the practice of medicine, and some medical professionals are using these tools to help with diagnosis. Doctors at Beth Israel Deaconess, a teaching hospital affiliated with Harvard Medical School, decided to explore how chatbots could be used—and misused—in training future doctors.
Instructors like Dr. Rodman hope that medical students can turn to GPT-4 and other chatbots for something similar to what doctors call a curbside consultation: when they call a colleague aside and ask for an opinion on a difficult case. The idea is to use a chatbot in the same way that doctors consult each other for suggestions and insights.
For more than a century, doctors have been portrayed as detectives collecting clues and using them to find the culprit. But experienced doctors actually use a different method, pattern recognition, to figure out what’s wrong. In medicine, it’s called a disease script: signs, symptoms, and test results that doctors piece together to tell a coherent story based on similar cases they know of or have seen themselves.
If the disease script doesn’t help, Dr. Rodman said, doctors turn to other strategies, such as assigning probabilities to various diagnoses that might fit.
Researchers have tried for more than half a century to design computer programs to make medical diagnoses, but nothing has succeeded.
Doctors say that GPT-4 is different. “It will create something that is remarkably similar to a disease script,” Dr. Rodman said. In that way, he added, “it’s fundamentally different from a search engine.”
Dr. Rodman and other doctors at Beth Israel Deaconess have asked GPT-4 for possible diagnoses in difficult cases. in a study published last month in the medical journal JAMA, found that he did better than most doctors in weekly diagnostic challenges published in the New England Journal of Medicine.
But, they learned, there is an art to using the program, and there are pitfalls.
Dr. Christopher Smith, director of the medical center’s internal medicine residency program, said medical students and residents are “definitely using it.” But, he added, “whether they are learning anything is an open question.”
The concern is that they could trust AI to make diagnoses in the same way they would trust a calculator on their phones to solve a math problem. That, said Dr. Smith, is dangerous.
Learning, he said, involves trying to figure things out: “This is how we retain things. Part of learning is the struggle. If you outsource the learning to GPT, that fight is over.”
At the meeting, students and residents divided into groups and tried to find out what was wrong with the patient with the swollen knee. Then they turned to GPT-4.
The groups tried different approaches.
One used GPT-4 to do an internet search, similar to the way you would use Google. The chatbot spat out a list of possible diagnoses, including trauma. But when asked by group members to explain his reasoning, the bot was disappointed, explaining his choice by saying, “Trauma is a common cause of knee injury.”
Another group thought of possible hypotheses and asked GPT-4 to review them. The chatbot’s list matched the group’s: infections, including Lyme disease; arthritis, including gout, a type of arthritis that involves crystals in the joints; and trauma.
GPT-4 added rheumatoid arthritis to the top possibilities, although it was not high on the group’s list. The instructors later told the group that gout was unlikely for this patient because she was young and female. And rheumatoid arthritis could probably be ruled out because only one joint was swollen and only for a couple of days.
As a curbside query, GPT-4 seemed to pass the test, or at least agree with students and residents. But in this exercise, he offered no ideas or illness script.
One reason could be that students and residents used the bot more like a search engine than a curbside query.
To use the bot correctly, the instructors said, they would have to start by telling GPT-4 something like: “You are a doctor seeing a 39-year-old woman with knee pain.” They would then need to list her symptoms before requesting a diagnosis and follow up with questions about the bot’s reasoning, just as they would with a medical colleague.
That, the instructors said, is one way to exploit the power of GPT-4. But it’s also crucial to recognize that chatbots can make mistakes and “freak out” – they provide answers without any basis in fact. Using it requires knowing when it is wrong.
“It’s not bad to use these tools,” said Dr. Byron Crowe, an internal medicine physician at the hospital. “You just have to use them the right way.”
He gave the group an analogy.
“Pilots use GPS,” Dr. Crowe said. But, he added, airlines “have a very high standard of reliability.” In medicine, he said, using chatbots “is very tempting,” but the same high standards must be applied.
“It’s a great thought partner, but it’s not a substitute for deep mental experience,” he said.
When the session ended, the instructors revealed the real reason for the patient’s knee swelling.
It turned out to be a possibility that all groups had considered and that GPT-4 had proposed.
He had Lyme disease.
Olivia Allison contributed reporting.