Newsletter Archive
Listen to Our Podcast
Dear Aventine Readers,
Remember when we wrote about AI large language models and their hallucination problem two years ago? Well, hallucinations are still here, and in some cases are getting worse, putting any company using AI tools at risk. But there are ways to mitigate the dangers. In this issue we look at the workarounds and safety measures companies are testing out in order to make large language models safe enough to use, and at the regulatory and legal frameworks that could be deployed to manage any potential harm done by these models in the future.
Also in this issue:
Thanks for reading,
Danielle Mattoon,
Executive Director, Aventine
Subscribe
Subscribe to our newsletter and be kept up to date on upcoming Aventine projects
We Can’t Escape AI’s Hallucinations. What Does That Mean for Its Adoption?
In 2022, British Columbia resident Jake Moffatt found some solace in the words of a chatbot after his grandmother died. He was told by Air Canada’s customer service software that he could book a regular flight to attend his grandmother's funeral and apply for a reduced-rate bereavement fare later. But when he requested that discount, the airline informed him that the chatbot had been wrong: It had described a company policy that did not exist. Air Canada refused to pay a refund, arguing that the chatbot was a separate entity responsible for its own actions.
A 2024 tribunal ruled in Moffatt’s favor, arguing that Air Canada was responsible for the bot, its output and the repercussions, and the company was forced to pay damages and tribunal fees.
The prospect of automated systems communicating false information to customers is a chilling one for companies looking to deploy modern artificial intelligence. While large language models (LLMs) underpinning tools like OpenAI’s ChatGPT and Anthropic’s Claude are increasingly able to write useful research reports, answer science questions and compose computer code, they still invent facts, a behavior commonly referred to as hallucinating. In fact, some of the most advanced AI models publicly available, which are capable of reasoning through certain complex problems, are hallucinating more than their predecessors, making errors on some types of factual tasks almost half of the time.
Air Canada’s experience does not stand alone. There have been other high-profile examples of AI systems hallucinating and causing reputational damage. Last year, the European logistics company DPD shut down part of its customer service bot after it was prodded into swearing at a customer and describing the company as the “worst delivery service company in the world.” Earlier this year, the UK-based company Virgin Money had to apologize after a chatbot admonished a user for using the word “virgin.” And just last month, the US software startup Cursor had to run damage control after its chatbot told customers about a radical change in its usage policy that turned out to be entirely fictional. There are also numerous cases of individual employees using LLMs without thoroughly checking their output, resulting in, for instance, a spate of court filings that contain fictional citations.
This is an obvious red flag for organizations exploring how to make best use of LLMs without landing themselves in trouble. “If you can get one of these language models to be about 80 percent accurate, [that means] 20 percent of the output is just not going to be accurate,” said Albert Sanchez Graells, a law professor at the University of Bristol in the UK who works on digital regulation. “If I had an employee that screws up every fifth time, how long before I fire them?”
Aventine spoke with experts in LLMs and AI safety, as well as legal and business experts, to understand the persistence of hallucinations and how it might shape the adoption of artificial intelligence going forward. While there are lots of workarounds to mitigate against the errors AI makes, there is no obvious solution to the underlying problem: LLMs hallucinate. With little clear guidance from regulators about how to proceed, companies must seek to understand the limitations of these AI models, properly assess the tasks they’re asking them to perform, and determine their appetite for risk. Businesses must also recognize that in some cases they might be better off not using these tools at all, at least for the time being.
Not all LLMs hallucinate equally
For as long as LLMs have existed, so have hallucinations. It was the topic of our inaugural newsletter in June 2023. At the time, David Ferrucci, the computer scientist who led the team that built IBM’s AI system Watson, explained that hallucinations were an inherent byproduct of the LLM architecture. “They're designed to generate sequences of words that don't necessarily appear in any particular place,” he said at the time. “In some sense, they're always hallucinating.” We just tend to call the issue hallucination when the AI gets something wrong; the rest of the time, we’re perfectly happy.
Hallucinations still persist, though the rate at which they occur varies dramatically between different types of model. Over the past two years, the overall trend is that hallucination rates in many AI models have fallen. According to data collected by the AI company Hugging Face and analyzed by the technologist Jakob Nielsen, the hallucination rate of LLMs has so far decreased by around 3 percentage points each year. Some of the latest LLMs now hallucinate at rates between 1 and 3 percent, according to analysis by Vectara, a startup that helps companies to deploy these AI models. This may be partly due to the size of the models: As they have grown in complexity, measured by their number of parameters, many have seen their hallucination rates fall. Improvement may also be a result of tools layered on top of the models, notes Jiacheng Liu, an AI researcher working on a PhD at the University of Washington and at the nonprofit AI research center Ai2. Such systems may, for example, filter out clearly incorrect content using hard-coded rules.
Yet the most advanced reasoning models, such as OpenAI’s o3 and DeepSeek’s R1, based on LLMs but using so-called “chain of thought” approaches to break down complex problems into smaller chunks to solve sequentially, seem to buck this trend of decreasing hallucination rates. These models have been able to solve more advanced problems than regular LLMs, performing well on mathematical and scientific problems, for instance. Yet they also appear to hallucinate more frequently than regular LLMs, according to Vectara’s analysis. A report published by OpenAI about the performance of its latest reasoning models also showed that its most advanced reasoning models hallucinate more than their predecessors: When asked to summarize publicly available information about people, for instance, OpenAI’s o1 reasoning model, released in 2024, hallucinated 16 percent of the time, while its new models, o3 and o4-mini, hallucinated 33 percent and 48 percent of the time respectively.
Why are notionally smarter models making more mistakes? Researchers aren’t totally sure. Liu told Aventine that there may be a trade-off between reasoning and hallucinations. “There may be a competition between factual knowledge and reasoning capabilities in the space of their parameters,” he said. Reasoning requires an element of creativity: Rather than provide the most likely answer, reasoning models attempt to justify each logical step they take to reach a solution, pursuing multiple lines of thinking before finally presenting a user with the best answer. Errors that go undetected as part of that process could compound while a model works through a problem. At the time of writing, OpenAI had not responded to a request for comment. The company’s report explained that more research is needed to understand the cause of the increased hallucinations in its models.
A problem that can only be mitigated, not solved
Mitigating the impact of hallucinations is an ongoing research problem, but there are already a variety of strategies and tools to fend them off. Iain Mackay, director of AI safety and government at Faculty AI, a London-based company that advises organizations on deploying AI, pointed out that one of the first considerations is simply choosing the right model. It might be, for example, that a model built by Anthropic hallucinates less for a given task than one built by OpenAI. Large organizations, such as the law firm Linklaters, he said, have started to audit AI models to find the best ones for specific sorts of tasks.
It is also possible to layer extra systems on top of LLMs to decrease hallucinations. Retrieval-augmented generation (RAG), a technology first developed in 2020 that has gotten traction in the last two to three years, was mentioned by every expert Aventine spoke with. It allows generative AI systems to look up facts in a cache of documents — effectively a database of information created and owned by the company deploying the LLM — to help ensure that the facts are correct. Models can also be fine-tuned to perform better on highly domain-specific tasks or be forced to provide citations for claims. Additionally, filters can ensure that a model’s output doesn’t include, say, offensive language or personally identifying information, explained Mike Miller, a director of AI product management at Amazon Web Services. Or a second AI model can be used to study the factual statements in the output of a first, explained Mackay, an approach that takes advantage of the fact that different models have different strengths.
Amazon has developed a system using an approach called automated reasoning to convert policies written in natural language — rules about an airline’s ticketing practices, for example — into formal logic statements that an AI model can then use to ensure it doesn’t go off-message. “Say a user asks an [airline chatbot], ‘Hey. I bought my ticket thirty-one days ago, is it eligible for a refund?’ Well, the ticket refund policy may depend on whether you booked that ticket through the carrier [or a partner],” said Miller. The technology can codify that and make sure a customer gets the correct information, he added. Liu, meanwhile, is part of a team that has built a tool to help identify the source of some hallucinations in training data, which could be used to scour datasets for misinformation that causes models to provide inaccurate responses.
But these approaches don’t solve the problem entirely. Even when some of these tools are used in combination, some errors still get through. Hallucinations are “acknowledged to be just one of the things that we're going to have to work around,” said Kurt Muehmel, head of AI strategy at Dataiku, a company that helps large companies to adopt AI tools. “It seems like they're not going away.”
A mixed bag of adoption
If hallucinations can’t be eliminated, organizations must think carefully about how and where they deploy LLMs. Given the high potential for things to go wrong, adoption might be expected to run along what are now familiar lines: Industries that are more heavily regulated, such as finance and health care, are likely to be more bearish about adopting the technology, compared to others such as tech or retail.
Adoption rates to date seem to back that up. A McKinsey & Co. report published in March showed that 71 percent of companies now use generative AI in at least one business function, up from 65 percent in early 2024. Adoption in the health care and financial services sectors is lower, at 63 percent and 65 percent respectively. Adoption in the technology sector, meanwhile, is much higher, at 88 percent. And within technology there are some pockets of intense adoption: One recent survey shows that over 97 percent of software developers have used AI coding tools at work, and representatives from both Google and Microsoft have said that as much as 30 percent of new code written inside the companies is now AI-generated.
Interestingly, though, across all industries there appears to be greater variation in adoption across different work functions than there is between sectors. There is, for instance, far higher adoption in areas such as marketing and sales, at 42 percent, compared to functions such as inventory management or legal, where the adoption rates are 7 percent and 11 percent respectively. Muehmel said that Dataiku observed this trend among its clients. “In a pharmaceutical company, sure, you have, super strict processes in certain parts of the business where you absolutely cannot have error,” he said, while other business processes might be “super low stakes, maybe high cost, maybe high effort, maybe high labor, which could really benefit from automation.”
Even so, determining which applications LLMs can be used in and assessing what guardrails might be necessary is onerous and costly, particularly for smaller, less well-capitalized companies.
A lack of clear guidance
Companies clearly have an interest in reducing the rate of hallucinations to avoid potential legal action or reputational damage. Yet there is also the possibility that the phenomenon could one day be policed by regulation, which would give companies a clear mandate on how exactly they should handle hallucination risk. Right now, however, that prospect seems far off.
The EU’s AI Act — widely seen as the most forward-looking regulatory framework globally — still offers little direct guidance on LLMs, said Sanchez Graells. A code of practice for how organizations use general purpose AI, which would include LLMs, has been repeatedly delayed and potentially watered down by lobbyists. It is currently in its third draft and is expected to be finalized at any moment. Rules around the use of AI in the US remain in limbo as a result of a review instigated by the Trump administration that is designed to create policy that will “retain global leadership” in the technology.
So what might regulation look like? Perhaps not rules for AI itself, but for the professionals using it. “What we may have is professional body regulation,” said Sanchez Graells, where professionals such as doctors, lawyers and accountants are given guidance about how to use LLMs in their work. But even then, it is unclear exactly what that kind of guidance might be mandated. Just as it is not illegal for doctors to use Google to inform their work, it also seems unlikely that personal use of AI among professionals could become illegal; Instead, professionals are simply expected to discharge their duties to the best of their abilities, and are held liable for the mistakes they make.
“Anytime that there is a claim about expertise made, then the liability has to follow,” said Sanchez Graells. “I don't think any company can get out of [a legal challenge by] saying, ‘This contract was written by the trainee. They were not very good.’ I wouldn't treat the LLM differently, because in the end, it's an organizational choice.”
So every organization will instead need to feel its own way through the deployment of LLMs, guided by their risk tolerance. One idea that came up frequently in the conversations Aventine had was that organizations must get used to dealing with AI error rates. Rather than thinking about AI models in a binary way — deciding that AI can be used only if it doesn’t make mistakes — experts suggested that companies will have to determine what rate of error they can tolerate, and design, test, deploy and monitor their systems accordingly. While this can feel uncomfortable, it’s worth remembering that “human employees make errors too,” noted Muehmel.
As for concern around potential legal issues, different companies are likely to take different approaches. Mackay said that for Faculty’s clients, applications that present problems around legal liability are typically not pursued. “Use cases where firms perceive legal liability don't make it through the triage process at the moment, because there's a lack of clarity sometimes on what is required and what is permitted,” he added. For others, legal wrangling might just be the cost of doing business. “There may be a lawsuit or two, which does cost them money,” said Bhaskar Chakravorti, the dean of global business at The Fletcher School at Tufts University. “But over the long haul, it’s quite possible that saving on the wages of real human beings is just preferable.”
The risk of paying out damages may become less important over time: An emerging field of insurance promises to pay out for costs incurred if AI tools underperform. Insurers at Lloyd’s of London have launched products to cover such misfires, according to the Financial TImes. One set of policies, developed by a startup called Armilla, would be based on benchmarking an AI tool to determine how well it works. If it later underperformed compared to that standard and that underperformance resulted in costs such as damages and legal fees, the policy would pay out. But it’s early days for these policies, and the insurer told the Financial Times that it would “be selective” about what systems it extended cover to.
For now, said Muehmel, adoption is still most likely to be found in “low stakes, high gain opportunities.”
Listen To Our Podcast
Learn about the past, present and future of artificial intelligence on our latest podcast, Humans vs Machines with Gary Marcus.
Advances That Matter
Google’s algorithm-designing AI can crack real-world problems. Making more efficient use of data centers, designing AI chips and dreaming up faster ways to manipulate algebra: A new AI system developed by Google DeepMind, called AlphaEvolve, has achieved all that and more. AlphaEvolve, which is a collection of AI tools that work together, can be prompted like a regular large language model to find solutions to complex problems. It uses Google’s Gemini large language models (LLMs) to write algorithms for a given challenge, runs them, and assesses the results and efficiency of the solutions. It then takes the best candidates and keeps trying to improve them until it can do no better. Some of the results are striking. In one test, AlphaEvolve developed a new way to multiply grids of numbers, an algebraic task called matrix multiplication, and found a solution for multiplying 4x4 grids faster than the previous record, which relied on a method developed by humans 56 years ago. In another, it worked out a way for Google to free up 0.7 percent of its computing resources across its data centers. It also found ways to improve the power consumption of Google’s own AI chips, developed faster methods to train AI algorithms, and came up with new solutions to mathematical puzzles that beat existing approaches in 20 percent of cases. The tool is resource-intensive to run, according to Nature, and it’s unclear how well it will perform on tasks that aren’t described by Google. But it’s yet another example of how AI can increasingly tackle problems and find solutions of the kind that were previously accessible only to humans.
The world’s first personalized CRISPR treatment. A baby with a life-threatening genetic disorder has become the first-ever patient to receive a custom-made treatment. The boy, named KJ, was born with CPS1 deficiency, a condition that affects one in 1.3 million babies. The condition means that his body doesn’t produce a liver enzyme that is required to process the ammonia in the bloodstream that is a byproduct of the body breaking down proteins. Untreated, the ammonia can build up and damage the brain. The best current treatment is a liver transplant, but about half of babies with the condition die before receiving a new organ. Doctors at the Children’s Hospital of Philadelphia, Pennsylvania, had been developing CRISPR-based approaches for the condition, and raced to develop a personalized treatment for KJ soon after his birth. Working with academic and industrial partners, the team produced a therapy, and when KJ was 6 months old formally submitted it to the FDA, where it was approved within a week. The treatment uses the CRISPR editing tool to target the patient's specific mutation and edit their DNA so that it contains the correct instructions to produce the required enzyme. KJ was given one small dose of treatment in February 2025, as a six-month-old, then larger doses in March and April. While it is too soon to say that he is cured, he is now able to consume more protein than he could before the treatment, while also taking lower doses of the drugs that were being used to control his condition. The work, described in the New England Journal of Medicine, is a milestone for personalized medicine, demonstrating that CRISPR could be used to treat a unique genetic mutation. Hanging over that, though, is a question of cost: Much of the work undertaken for KJ’s treatment was done for free, according to New Scientist. But the reality is that bespoke treatment such as this will be prohibitively expensive for most of us for many years, and potentially even decades, to come. Casgevy, for example, a treatment for sickle cell disease and the first CRISPR therapy approved by the FDA, costs $2.2 million per patient.
Chinese chips are getting better, fast. Unable to buy the world’s most advanced superconductor chips due to export controls, yet determined to build out cutting-edge AI, China has been forced to invest heavily in its domestic semiconductor industry. The Economist reports that the effort is going well. The Chinese company Huawei, for instance, has built what’s known as a cluster — a device comprising multiple chips to perform large AI tasks — made up of 384 of its Ascend chips, which is said to perform better than one of Nvidia’s most powerful comparable systems, the NVL72 cluster. There are also other impressive homegrown AI chips emerging in China: Two startups — Cambricon and Hygon — are reportedly building chips that can be substituted for Nvidia’s A100, the company’s high-performance AI chip. Another, CXMT, is developing high-bandwidth memory chips typically used alongside AI chips that could soon compete with similar chips made by South Korea’s SK Hynix, the market leader. China is also developing some of the equipment required for chip manufacturing, such as tools for etching or depositing materials on the chips. But despite progress, there is more work to be done. The nation has yet to develop its own versions of the tools used in the world’s most advanced chip-making facilities. And for now its homegrown AI chips must be used in conjunction with equally homegrown control software — software that is reported to be far inferior to the industry-standard equivalent used with Nvidia’s devices. Still, this is just the start of China’s domestic chip sector. And the emergence of DeepSeek, the Chinese AI model whose training required less computing power than its competitors to provide industry-standard results, showed that the country is able to make significant strides without access to cutting-edge hardware.
Magazine and Journal Articles Worthy of Your Time
The Price of Remission, from ProPublica
7,000 words, or about 28 minutes
Drugs are expensive, particularly those created to treat rare conditions. Usually that is justified in simple economic terms: The pharmaceutical companies producing the drugs must spend tens or hundreds of millions of dollars to develop and test new therapies, many of which will never make it to market. If the company is to recoup its costs it must charge high prices, especially if the market for the medication is small. Yet this investigation by ProPublica digs into that justification, exploring how the drugmaker Celgene raised the price of one treatment — a cancer medication called Revlimid that is based on the controversial drug thalidomide — a total of 26 times, from $280 to $892 per daily pill, in less than two decades. The story describes how the company manipulated patent law and the US drug safety system to prevent other companies from making generic versions of the drug, which is used to treat multiple myeloma (of which there are about 36,000 new cases per year in the US), while simultaneously increasing the drug’s price. Those high prices — for a pill that costs 25 cents to manufacture — have put the drug out of reach for some patients and contributed enormously to insurance costs; in 2017, Revlimid was America’s most expensive drug. This story is a fascinating glimpse into the mechanics of drug pricing and its impact on patients.”
North Korea Stole Your Job, from Wired (And later in the Wall Street Journal)
4,000 words, or about 16 minutes
The rise of remote work has created an opportunity for exploitation: For decades, the North Korean intelligence services have been steadily installing their own workers — posing as Westerners — in remote jobs within US businesses to make money that gets funneled back to the North Korean government. The scheme takes years of preparation. First, young North Koreans are trained to become highly proficient IT experts, then they are shipped abroad, most often to China or Russia. Once there, they apply for — and often secure — high-paying remote-work jobs in Western companies. This happens with the help of fixers inside the US who have been paid and cultivated — often unwittingly, at first — by North Korea to help perpetuate the fraud that the participants are American citizens working in the US. The fixers forge IRS paperwork, receive and transfer electronic deposits (taking a cut) and keep laptops through which North Korean employees can tunnel through to their jobs and appear to be connecting through a US IP address. The Wired story explains how one hiring manager, Simon Wijckmans, founder of the web security startup C.Side, became suspicious of applicants with poor English who appeared to be cheating their way through interviews. But many companies haven’t been so vigilant and there are now several investigations into hiring scams. Prosecutors say that some of the companies that fell victim to the schemes include “a top-five national television network and media company, a premier Silicon Valley technology company, an aerospace and defense manufacturer, an iconic American car manufacturer, a high-end retail store, and one of the most recognizable media and entertainment companies in the world.”
The wild idea that we all get nutrients from the air that we breathe, from New Scientist
2,400 words, or about 10 minutes
Far-fetched as it may sound, the idea that we can absorb nutrients from the air we breathe is starting to gain some traction. A pair of researchers from the University of Newcastle and the Royal Melbourne Institute of Technology, both in Australia, tell New Scientist that they set out to disprove the idea, but then came across a significant amount of evidence that served to reinforce it. A series of historic experiments dating back as far as the 1960s, for instance, show that iodine, an important micronutrient for thyroid function, appears to be readily absorbed by humans from the air. Other studies have shown that the same may be true for manganese, zinc, iron and all-trans retinoic acid, which are all important micronutrients for human health. While so-called aero nutrients may be found only in tiny concentrations in the air, there’s also some evidence to suggest that the area in the brain related to smell may facilitate a direct nose-to-brain absorption path that seems to make the delivery of some airborne molecules surprisingly potent. There are still many open questions to consider — around which sorts of nutrients are delivered in this way and how efficient the absorption really is — but the research presents a potential new way to boost nutrient levels across populations if necessary.