A few months ago, my doctor showed off an AI transcription tool he used to record and summarize patient meetings. In my case, the summary was fine, but researchers cited in this report by The Associated Press have found that’s not always the case for transcriptions created by OpenAI’s Whisper, which powers a tool many hospitals use — sometimes it just makes things up entirely.
Whisper is used by a company called Nabla for a tool that it estimates has transcribed 7 million medical conversations, according to AP. More than 30,000 clinicians and 40 health systems use it, the outlet writes. The report says that Nabla officials “are aware that Whisper can hallucinate and are addressing the problem.” In a blog post published Monday, execs wrote that their model includes improvements to account for the “well-documented limitations of Whisper.”
According to the researchers, “While many of Whisper’s transcriptions were highly accurate, we find that roughly one percent of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio… 38 percent of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority.”
The researchers noted that “hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations,” which they said is more common for those with a language disorder called aphasia. Many of the recordings they used were gathered from TalkBank’s AphasiaBank.
One of the researchers, Allison Koenecke of Cornell University, posted a thread about the study showing several examples like the one included above.
The researchers found that the AI-added words could include invented medical conditions or phrases you might expect from a YouTube video, such as “Thank you for watching!” (OpenAI reportedly used to transcribe over a million hours of YouTube videos to train GPT-4.)
OpenAI spokesperson Taya Christianson emailed a statement to The Verge:
We take this issue seriously and are continually working to improve, including reducing hallucinations. For Whisper use on our API platform, our usage policies prohibit use in certain high-stakes decision-making contexts, and our model card for open-source use includes recommendations against use in high-risk domains. We thank researchers for sharing their findings.
On Monday, Nabla CTO Martin Raison and machine learning engineer Sam Humeau published a blog post titled “How Nabla uses Whisper.” Raison and Humeau say Nabla’s transcriptions are “not directly included in the patient record,” with a second layer of checking by a large language model (LLM) query against the transcript and the context of the patient and that “Only facts for which we find definitive proof are considered valid.”
They also say that it has processed “9 million medical encounters” and that “while some transcription errors were sometimes reported, hallucination has never been reported as a significant issue.”
Update, October 28th: Added blog post from Nabla.
Update, October 29th: Clarified that the Cornell University, etc. study was peer-reviewed.
Correction, October 29th: A previous version of this story cited ABC News. The story cited was published by The Associated Press, not ABC News.