Before Feeding Bad Data to AI, Do Not Just Label It “False”

Larry

An obsolete refund policy is still in the knowledge base. Its filename says “old version—do not use,” so a teammate browsing the folder is unlikely to mistake it for current guidance. A RAG assistant has a different view: it may retrieve the paragraph with the refund amount while leaving the warning behind, then present the old rule to a customer as if it were valid.

Counterexamples in fine-tuning and evaluation data can fail in a related way. The 2026 paper “Negation Neglect” found that models fine-tuned on documents that explicitly negated false claims could still learn those claims as true. Ars Technica highlighted a reported 88.6% belief rate for negated documents. The study also found that local negation placed close to a claim was more reliable than a separate warning sentence—but “more reliable” is not the same as safe enough for production.

The useful question is therefore not whether models should ever see bad examples. It is whether the material can cross the data, retrieval, or output boundary and become an input to a real decision.

Classify the consequence before keeping the data

A warning that is adequate in a worksheet read by people may be dangerously weak in an indexed corpus. Start with where the information can travel and what happens if the warning is missed.

Data situation	Is prose warning enough?	Additional guardrail
Low-risk counterexample read only by people	Sometimes, if its boundaries are unmistakable	Clear heading, example category, human review
RAG corpus contains expired or withdrawn material	No	Version and validity fields, retrieval filter, source citation
Fine-tuning or evaluation set contains false examples	No	Structured status, negation-focused tests, refusal or downranking rules
Content can affect customer promises, medical or legal guidance, payments, or permissions	Definitely not	Source verification, human approval, output checks, audit records

Human readers normally see layout, surrounding explanation, and document history together. Retrieval can separate those signals. A chunk may retain the claim while losing the sentence that rejects it. Fine-tuning introduces another uncertainty: the team may not be able to tell whether the model learned the logical relation or merely became more familiar with the false statement. Once an answer can authorize a payment, alter access, or create a customer commitment, “the warning looked clear to us” is not a meaningful control.

Put guardrails at data, retrieval, and output boundaries

At the data boundary, make status machine-readable. A note beside a paragraph should not be the only indication that it is false or expired. Use fields such as claim, status:false, source, and valid_until. Withdrawn records should be isolated or excluded. Counterexamples retained for evaluation should live apart from records that are eligible to support an answer.

At the retrieval and test boundary, check what actually resurfaces. Filter expired, withdrawn, and low-trust records both when building the index and when retrieving evidence. Test paraphrases rather than repeating only the original wording: change the customer, date, and framing to see whether the model reconstructs the same bad claim. If a model can also act on what it retrieves, Before Letting an AI Agent Write Code, Put Checkpoints into the Task offers a practical way to stop high-impact actions until a person has reviewed the evidence.

At the output boundary, contain the consequence. Answers about customer promises, healthcare, law, payments, and permissions should carry citations, validity checks, and an explicit human approval point. A stop control is not enough if an action can fail after making partial changes. When an Automation Fails Halfway, Who Cleans It Up? explains why the owner and compensating action should be known before automation begins.

These controls can be proportional. A low-risk internal teaching set may need clear separation and sampling. A system that can affect money, health, rights, access, or external commitments needs isolation, targeted evaluation, output restrictions, and accountable human review. The mistake is not keeping a bad example; it is treating a sentence addressed to a human reader as a complete machine safety mechanism.

AI handoff card

Give the following instruction to an agent that can inspect the current workspace, repository, project, or operational records. The initial pass is read-only and must not alter data or configuration.

Begin with the workspace, repository, project files, or operational records you can currently access. In read-only mode, locate the first concrete instance of content that is expired, withdrawn, intentionally false, or retained as a counterexample but could still be treated as usable fact by a model, fine-tuning process, or RAG system. Cite direct evidence such as a path, field, record, or configuration. Keep observed facts separate from inference, and mark unavailable facts as “needs confirmation.”

For that one instance, inspect whether the data layer has claim, status:false, source, valid_until, or equivalent machine-readable controls and whether the record is isolated. Then examine retrieval or evaluation filters and tests, including paraphrased queries. Finally, identify citation, validity, refusal, and human-approval requirements at the output boundary. Do not write files, rebuild an index, trigger external actions, or handle account, payment, personal-data, medical, legal, or permission changes.

If no relevant project material is accessible, ask at most one precise question for a read-only location. End with exactly one decision—“proceed,” “limited trial,” or “pause”—supported by direct evidence. Include unresolved items, one reversible action that can be taken now, and the boundary where a named human owner must review or approve further work.

A kitchen test for warning-only controls

Four-panel comic in which Maya stops before using two similar white kitchen powders, switches containers, stores them separately, and asks another person to verify them

Maya faces two look-alike containers of white powder in the kitchen and cannot safely tell which one to use.
She pauses before scooping, knowing that a small warning note is easy to overlook.
Maya moves the powders into clearly different containers and stores them apart.
A second person verifies the contents, making separation plus human confirmation the memory cue for keeping bad data out of a model.

References

arXiv: Negation Neglect: When Models Fail to Learn Negations in Training — https://arxiv.org/abs/2605.13829 (2026-05-19)
Ars Technica: LLMs believe false statements even after explicit warnings that they’re false — https://arstechnica.com/ai/2026/05/llms-believe-false-statements-even-after-explicit-warnings-that-theyre-false/ (2026-05-21)
ACL Anthology: García-Ferrero et al., “This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models,” EMNLP 2023, DOI 10.18653/v1/2023.emnlp-main.531 — https://aclanthology.org/2023.emnlp-main.531/ (2023-12)
ACL Anthology: Varshney et al., “Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation,” Proceedings of TrustNLP 2025, DOI 10.18653/v1/2025.trustnlp-main.37 — https://aclanthology.org/2025.trustnlp-main.37/ (May 2025)

Classify the consequence before keeping the data

Put guardrails at data, retrieval, and output boundaries

AI handoff card

A kitchen test for warning-only controls

Share this mini class

References

Related reading

Before Adding AI to a Workflow, Define When It Must Stop

Does Your Website Need an AI Entry Point? Start With This Checklist

Can your evaluator tell when AI-optimized code is actually good enough?