The Trump administration sees an AI-driven federal workforce as more efficient. Instead, with chatbots unable to carry out critical tasks, it would be a diabolical mess

Imagine calling the Social Security Administration and asking, “Where is my April payment?” only to have a chatbot respond, “Canceling all future payments.” Your check has just fallen victim to “hallucination,” a phenomenon in which an automatic speech recognition system outputs text that bears little or no relation to the input.
Hallucinations are one of the many issues that plague so-called generative artificial intelligence systems like OpenAI’s ChatGPT, xAI’s Grok, Anthropic’s Claude or Meta’s Llama. These are design flaws, problems in the architecture of these systems, that make them problematic. Yet these are the same types of generative AI tools that the DOGE and the Trump administration want to use to replace, in one official’s words, “the human workforce with machines.”
This is terrifying. There is no “one weird trick” that removes experts and creates miracle machines that can do everything that humans can do, but better. The prospect of replacing federal workers who handle critical tasks—ones that could result in life-and-death scenarios for hundreds of millions of people—with automated systems that can’t even perform basic speech-to-text transcription without making up large swaths of text, is catastrophic. If these automated systems can’t even reliably parrot back the exact information that is given to them, then their outputs will be riddled with errors, leading to inappropriate and even dangerous actions. Automated systems cannot be trusted to make decisions the way that federal workers—actual people—can.
On supporting science journalism
If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
Historically, “hallucination” hasn’t been a major issue in speech recognition. That is, although earlier systems could take specific phrases and respond with transcription errors in specific phrases or misspell words, they didn’t produce large chunks of fluent and grammatically correct texts that weren’t uttered in the corresponding audio inputs. But researchers have shown that recent speech recognition systems like OpenAI’s Whisper can produce entirely fabricated transcriptions. Whisper is a model that has been integrated into some versions of ChatGPT, OpenAI’s famous chatbot.
For example, researchers from four universities analyzed short snippets of audio transcribed by Whisper, and found completely fabricated sentences, with some transcripts inventing the races of the people being spoken about, and others even attributing murder to them. In one case a recording that said, “He, the boy, was going to, I’m not sure exactly, take the umbrella” was transcribed with additions including: “He took a big piece of a cross, a teeny, small piece…. I’m sure he didn’t have a terror knife so he killed a number of people.” In another example, “two other girls and one lady” was transcribed as “two other girls and one lady, um, which were Black.”
In the age of unbridled AI hype, with the likes of Elon Musk claiming to build a “maximally truth-seeking AI,” how did we come to have less reliable speech recognition systems than we did before? The answer is that while researchers working to improve speech recognition systems used their contextual knowledge to create models uniquely appropriate for performing that specific task, companies like OpenAI and xAI are claiming that they are building something akin to “one model for everything” that can perform many tasks, including, according to OpenAI, “tackling complex problems in science, coding, math, and similar fields.” To do this, these companies use model architectures that they believe can be used for many different tasks and train these models on vast amounts of noisy, uncurated data, instead of using system architectures and training and evaluation datasets that best fit a specific task at hand. A tool that supposedly does everything won’t be able to do it well.
The current dominant method of building tools like ChatGPT or Grok, which are advertised along the lines of “one model for everything,” uses some variation of large language models (LLMs), which are trained to predict the most likely sequences of words. Whisper simultaneously maps the input speech to text and predicts what immediately comes next, a “token” as output. A token is a basic unit of text, such as a word, number, punctuation mark or word segment, used to analyze textual data. So giving the system two disparate jobs to do, speech transcription and next-token prediction, in conjunction with the large messy datasets used to train it, makes it more likely that hallucinations will happen.
Like many of OpenAI’s projects, Whisper’s development was influenced by an outlook that its former chief scientist has summarized as “If you have a big dataset and you train a very big neural network,” it will work better. But arguably, Whisper doesn’t work better. Given that its decoder is tasked with both transcription and token prediction, without precise alignment between audio and text during training, the model can prioritize generating fluent text over accurately transcribing the input. And unlike misspellings or other mistakes, large swaths of coherent text don’t give the reader clues that the transcriptions could be inaccurate, potentially leading users to use them in high-stakes scenarios without ever finding their failures. Until it’s too late.
OpenAI researchers have claimed that Whisper approaches human “accuracy and robustness,” a statement that is demonstrably false. Most humans don’t transcribe speech by making up large swaths of text that never existed in the speech they heard. In the past, those working on automatic speech recognition trained their systems using carefully curated data consisting of speech-text pairs where the text accurately represents the speech. Conversely, OpenAI’s attempt to use a “general” model architecture rather than one tailored for speech transcription—sidestepping the time and resources it takes to curate data and adequately compensate data workers and creators—results in a dangerously unreliable speech recognition system.
If the current one-model-for-everything paradigm has failed in the context of English language speech transcription that most English speakers can perfectly perform without further education, how will we fare if the U.S. DOGE Service succeeds in replacing expert federal workers with generative AI systems? Unlike the generative AI systems that federal workers have been told to use to perform tasks ranging from creating talking points to writing code, automatic speech recognition tools are constrained to the much more well-defined setting of transcribing speech.
We cannot afford to replace the critical tasks of federal workers with models that completely make stuff up. There is no substitute for the expertise of federal workers handling sensitive information and working on life-critical sectors ranging from health care to immigration. Thu, we need to promptly challenge, including incourts if appropriate, DOGE’s push to replace “the human workforce with machines,” before this action brings immense harm to Americans.
This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American