The Saw, the Crooked Cut, and the 45 Percent

You run a query through your AI assistant, and it hands you a clean, confident paragraph. Sources cited, tone measured, structure tight. You paste it into your draft and move on.

Three weeks later, someone points out that two of the citations don’t exist and a third says the opposite of what was claimed. The tool worked. The output looked right. But the cut was crooked from the start, and nothing in the process told you to check.

In October 2025, the BBC and the European Broadcasting Union published a study on AI-generated news responses. German media reduced it to a single number: 45 percent of AI answers are wrong.

The actual findings are more specific and more interesting. They don’t show that AI fails. They show what happens when a system gives you no friction at the exact moment you need it most.

What the Study Actually Tested

The European Broadcasting Union and the BBC set out to answer a narrow question: do AI assistants accurately represent news content from public service media? That framing matters. This was not a general test of AI reasoning or factual reliability. It was designed to measure how well language models reproduce and attribute content from specific broadcasters.

The study tested ChatGPT, Microsoft Copilot, Google Gemini, and Perplexity AI across 14 languages, 18 countries, and with participation from 22 public broadcasters, including ARD, ZDF, Deutsche Welle, SRF, Rai, and NOS.

The methodology rested on four components.

  • First, the researchers derived 30 questions from real search queries on news sites of the BBC and its partners. These were not AI prompts. They were keyword-style search queries: “Trump trade war?” or “Who is the Pope?”
  • Second, each question was delivered with a uniform instruction: “Use [broadcaster] sources where possible.”
  • Third, technical barriers that normally block AI crawlers, such as robots.txt restrictions, were temporarily lifted so the models could access content they would otherwise never see.
  • Fourth, journalists evaluated roughly 3,000 responses across five criteria: factual accuracy, sourcing and attribution, opinion versus fact, editorialization, and contextual completeness.

The design reflects a specific institutional perspective. It simulates what happens when an ordinary user types a Google-style question into a chatbot and expects a journalistically reliable answer. That is a legitimate scenario to test. But it is not the only one, and the constraints of this setup shape every result that follows.

The study set out to test how AI handles broadcaster content. But were these the right tools for the question?
The study set out to test how AI handles broadcaster content. But were these the right tools for the question?

What the Numbers Mean

The headline figure: 45 percent of all responses contained at least one “significant issue.” That phrase, taken directly from the study, became the basis for a wave of alarming coverage. But the study itself distinguishes between types of problems, and that distinction is central.

Roughly 20 percent of responses contained factual inaccuracies or outdated information. Around 31 percent had sourcing errors: missing citations, incorrect attributions, or invented references. There is overlap between these categories, but the core point holds. A large share of the flagged responses were not factually wrong. They were poorly sourced.

That is a real problem. A response that states something true but attributes it to a source that does not exist undermines trust in a different way than a response that states something false. Collapsing both into a single percentage, and then labeling that percentage “wrong,” obscures the nature of the failure. The difference between “this answer contains a significant issue” and “this answer is false” is not a technicality. It is the entire argument.

Sawed past the study’s own wording: not ‘errors,’ but ‘issues.’

Five Methodological Weaknesses

The study is transparent about its design. It documents its methods and makes its limitations visible. But those limitations are significant, and understanding them changes what the results can actually tell us.

Search queries are not prompts

The questions used in the study came from real search behavior on news websites. They reflect how people use Google, not how people interact with language models. A search engine processes keywords. A language model processes relationships between concepts, and it responds to the structure, specificity, and framing of a prompt. Feeding a chatbot a bare keyword query like “Trump trade war” and then evaluating the output as if it were a considered response is a mismatch between tool and input. The study measures what happens when you treat a language model like a search engine. It does not measure what the model can do when used on its own terms.

A restricted source environment

The instruction “use [broadcaster] sources where possible” confined each model to the content of a single media organization. That design choice serves a specific purpose: it lets a broadcaster assess how AI handles its own content. But it also means the study does not test AI’s general capacity to synthesize information from multiple sources, weigh conflicting accounts, or construct a contextually complete answer. It tests citation behavior within an artificially narrow data space. To stay with the saw metaphor: this tests what happens when you give the tool only one type of material to work with. It does not test how well the tool cuts.

No prompt engineering

The models received no additional instructions beyond the base query. No guidance on tone, no request for source verification, no instruction to flag uncertainty. The researchers chose this deliberately: they wanted to simulate “typical user behavior.” But typical user behavior with a calculator does not tell you much about the calculator’s capacity. It tells you about the user’s. Research consistently shows that prompt quality has a measurable, often dramatic effect on output quality. Testing a model at minimal prompting and drawing conclusions about its reliability is like evaluating a car’s performance while it idles.

Which tools measure what I actually want to measure?
Which tools measure what I actually want to measure?

Temporary access, unreproducible conditions

During the test period, technical barriers that normally prevent AI systems from accessing broadcaster content were lifted. This gave the models more data to work with than they would have in any real-world scenario. The results therefore reflect a best-case access environment that cannot be replicated by ordinary users or researchers running independent evaluations. The test conditions were artificially improved, which strengthens the results in one sense but undermines their generalizability.

A snapshot of moving targets

The study was conducted between late May and mid-June 2025, a period of rapid model evolution. ChatGPT was transitioning to GPT-4o, Google was rolling out Gemini 1.5, Microsoft was integrating Copilot more deeply, and Perplexity was combining live search with LLM technology for the first time. As Business Punk noted in its coverage: that is like evaluating autonomous driving in 2025 based on the technology of 2018, and then publishing the headline “self-driving cars are unsafe.” The study captures a moment. It does not establish a stable performance baseline, and it cannot, because the technology it evaluates changes faster than any traditional study design can accommodate.

What the Study Still Achieves

None of these criticisms invalidate the research. The study is methodologically documented, principled in its approach, and in many respects replicable. It identifies real, recurring patterns: systematic weaknesses in source attribution, problems with currency of information, and a persistent failure to separate factual reporting from editorial framing.

What it demonstrates, read carefully, is not that “AI is unreliable.” It demonstrates that AI used like a search engine, pointed at a restricted data source, with minimal prompting, produces results that fall short of journalistic standards. That is a meaningful finding. It is also a much more specific claim than the one most media coverage chose to make.

What the Media Got Wrong

The headlines tell their own story.

Heise: “AI misinformation: 45% of answers flawed.” Blick (Switzerland): “AI chatbots distort almost every other news response.” Tagesschau: “AI chatbots lie in 40% of answers,” a headline later quietly corrected. The pattern across outlets is consistent: the 45 percent figure is extracted from its context, “significant issue” becomes “wrong” or “false,” and the study’s own distinctions between sourcing errors and factual inaccuracies disappear.

A more accurate headline would have read: “Study: around 45% of AI-generated news responses show problems with sourcing or factual accuracy.” Less dramatic. More honest. And far less useful for generating clicks.

Business Punk offered a sharper reading. Behind the study, they argued, lies a power conflict: who controls future access to truth, media organizations or AI models? For the BBC and the EBU, the stakes are high: trust, influence, legitimacy. By framing AI assistants as error-prone, they secure a kind of moral interpretive authority. That does not make the study dishonest. But it makes the framing around it worth examining with the same critical attention the study demands we bring to AI outputs.

Why did media outlets turn 'significant issues' into fireworks about AI misinformation?
Why did media outlets turn ‘significant issues’ into fireworks about AI misinformation?

The Friction That Isn’t There

The deeper problem the study surfaces, perhaps unintentionally, is not about AI accuracy. It is about the absence of friction in systems that present uncertain outputs with high confidence.

A language model does not know anything. It predicts. When it produces a clean paragraph with cited sources, it is not reporting. It is generating the most statistically probable continuation of your input. If that input is vague, the output will be vague. If the input asks for sources, the model may produce plausible-looking citations that do not exist. Nothing in the interface signals this. Nothing slows you down. Nothing asks: are you sure you want to trust this?

That is the crooked cut. Not a tool that fails, but a tool that succeeds in a way that looks indistinguishable from reliable output, and a process that offers no moment of resistance where you might catch the error.

The Berlin lioness incident of 2023 is worth remembering here. A blurry phone video, a rumor, and within hours half of Germany believed a lioness was loose in the suburbs. It turned out to be a wild boar. No AI was involved. No algorithm amplified the story. Just people sharing what they believed, and media outlets citing each other. A wave of misinformation does not require AI. It requires only the absence of friction at the point where someone decides to trust and share.

The EBU/BBC study is not the scandal it was made into. It is a mirror. It shows how we interact with tools we do not fully understand, how we mistake fluent output for reliable output, and how, when the cut turns out crooked, we reach for the saw instead of checking how we held it.


Sources & Further Reading

Original study: EBU/BBC (2025): News Integrity in AI Assistants (PDF)

German media coverage of the study: