Methodology — Evidtrace

The Problem

Why AI Vendor Claims Need a Different Approach

AI vendors routinely make claims about model capabilities — benchmark scores, reasoning ability, safety metrics — that cannot be independently verified. Press releases are treated as evidence. Self-reported benchmarks are cited as fact.

Traditional fact-checking was built for politics and health. AI vendor claims require a fundamentally different approach: one that understands benchmarks, technical specifications, and the difference between self-reported and independently verified evidence.

24%

of vendor capability claims in our latest edition were independently supported by evidence

The Engine

The 6-Layer Verification Pipeline

Every claim passes through six independent verification layers before receiving a final credibility verdict. Each layer was calibrated through dozens of structured experiments.

A

Claim Extraction

Layer 1 of 6

Automated decomposition of vendor claims into atomic, verifiable statements
Identifies the specific claim, the claimed metric, the vendor, and the evidence source
Maps claim dependencies — when one claim relies on another being true

1,222

Claims extracted in the latest edition

567

Assessments processed

B

Source Calibration

Layer 2 of 6

Engine insight: 54% of stories had wire-rewrite duplicates — outlets rewriting AP/Reuters copy appear as independent reporting but aren't.

Collapses duplicate sources — greater than 70% sentence overlap is treated as a single evidential source
Detects PR pass-through content where press release language is republished with minimal editorial change
Identifies self-citation loops where a vendor's own claims circle back as "independent" sources
Maps anonymous source dependency and applies credibility discounts accordingly

35%

of stories contained near-verbatim press release text

19%

of stories had self-citation loops

31%

of claims anonymously sourced → 15% credibility discount

C

Confidence Intervals

Layer 3 of 6

Cross-checks headline figures against body text for numeric consistency
Requires both relative and absolute figures for percentage claims — context-free percentages are flagged
Currency normalisation to prevent false numeric disagreements across international sources
Temporal scoring: provisional verdicts carry time-confidence markers

86%

of headline rounding was in the dramatic direction

42%

of stories had context-free percentages

73%

of 1-hour verdicts matched 24-hour final

D

Independence Check

Layer 4 of 6

Maps funding relationships for think tank and research sources cited in vendor claims
Flags expertise mismatches — when cited experts are commenting outside their domain
Detects official source monopolies where claims rely entirely on a single official statement
Screenshot evidence is capped at "supported" — never elevated to "confirmed"

15%

cited potentially conflicted think tanks

12%

had out-of-domain expert citations

68%

official-source dependency in Defence stories

E

Decision Engine

Layer 5 of 6

Tracks claim language escalation across news cycles — detecting scope creep without new supporting evidence
Cross-references partial quotes against full transcripts for context-altering truncation
Distinguishes correlation from causation in science reporting and benchmark claims
Treats denials as evidence of position only — never as confirmation of the underlying claim

15%

showed scope creep without new evidence

12%

had context-altering quote truncation

F

Shadow Evaluation

Layer 6 of 6

Measures correction propagation — tracking whether corrections reach downstream outlets
Detects high-emotion framing and raises the evidentiary threshold accordingly
Evaluates claims independently against primary evidence, not prevailing narrative
Cross-language verification for international stories to detect translation-induced discrepancies

71%

of corrections went uncarried by downstream outlets

r=0.67

correlation between emotion and spread speed

23%

had narrative lock-in filtering contradictory evidence

30%

had cross-language discrepancies

Research Foundation

Built on 360 Structured Experiments

Every rule in the verification engine traces back to a specific experiment. Five research phases, from foundational calibration to live engine testing.

Phase I

Foundation

EXP-001 → 070

Evaluator rubric, claim extraction, source evaluation

Phase II

Deepen

EXP-071 → 130

Deepened findings across all foundation areas

Phase III

Expand

EXP-131 → 200

Real-world systems analysis, competitor review

Phase IV

Build

EXP-201 → 270

Technical specs, MVP roadmap, architecture

Phase V

Engine Testing

EXP-271 → 360

Live verification against real editions

Experiment Categories

Methodology

142

Source analysis

41

Claim validation

39

Narrative tracking

29

UK calibration

29

Edge cases

22

Source weighting

22

Numeric verification

19

Virality analysis

17

Key Findings

Featured Experiments

Six experiments that shaped the verification engine's core logic.

Source Analysis

The Wire Rewrite

54%

of stories had wire-rewrite duplicates. Outlets rewriting AP/Reuters copy appeared as independent reporting — inflating apparent source diversity without adding evidential weight.

Numeric Verification

The Headline Number

86%

of headline rounding was in the dramatic direction. Numbers were consistently rounded to make stories appear more significant, distorting the claims being verified.

Virality Analysis

The Outrage Multiplier

r=0.67

correlation between emotional framing and spread speed. High-emotion stories travelled faster but carried weaker evidence — requiring elevated evidentiary thresholds.

Narrative Tracking

The Correction Deficit

71%

of corrections went uncarried by downstream outlets. Once a claim enters the information ecosystem, corrections rarely propagate — the original claim persists.

Source Weighting

The Anonymous Source Discount

31%

of claims were anonymously sourced. The engine applies a systematic 15% credibility discount to anonymous claims — calibrated against cases where sources were later identified.

Claim Validation

The Narrative Lock-In

23%

of cases showed narrative lock-in — once a dominant interpretation formed, contradictory evidence was filtered out or underweighted by subsequent coverage.

Current Edition

Edition at a Glance

567

Articles Assessed

1,222

Claims Extracted

357

Sources Verified

67

Providers Tracked

12

Categories Covered

24%

Supported

24% Supported 25% Misleading 51% Unverifiable

Developer Access

API Reference

Programmatic access to Evidtrace intelligence. All endpoints return JSON with full provenance metadata.

GET

/api/articles

Query assessed articles with filtering by date, category, provider, and verdict. Returns full article metadata with linked claim IDs.

GET

/api/claims

Search verified claims by keyword, provider, or verdict status. Each claim includes its full verification trail and confidence score.

GET

/api/providers

Provider credibility profiles with historical scores, claim counts, category presence, and trend data across editions.

GET

/api/meta

Edition metadata including assessment counts, date range, methodology version, and aggregate statistics.

Regulatory Context

EU AI Act Relevance

How Evidtrace supports deployer obligations under Regulation (EU) 2024/1689

Important

Evidtrace is supplementary vendor intelligence designed to inform procurement decisions. It does not satisfy deployer obligations under Article 26 on its own and should be used alongside task-specific validation, operational controls, and legal review.

Article 26 — Deployer Obligations

The EU AI Act imposes specific obligations on deployers of high-risk AI systems. Article 26 requires deployers to:

Take appropriate technical and organisational measures to ensure use in accordance with instructions
Assign human oversight to natural persons with necessary competence, training, and authority
Ensure input data is relevant and sufficiently representative for the intended purpose
Monitor operation and report to providers and market surveillance authorities where appropriate
Keep logs automatically generated by the AI system for at least six months
Inform affected employees and workers’ representatives when high-risk AI systems are used in the workplace
Inform affected persons when the system assists decisions about them
Accommodate the right to explanation in relevant cases
Comply with database registration obligations (for public authority deployers)

Fundamental Rights Impact Assessment

Certain deployers — notably public authorities and organisations providing public services — must conduct fundamental rights impact assessments prior to first use of high-risk AI systems.

Enforcement Timeline

February 2025 — Prohibited AI practices enforcement begins

August 2025 — General-purpose AI model obligations apply

August 2026 — High-risk system requirements (Annex III) and most remaining provisions

August 2027 — High-risk AI embedded in regulated products (Annex I/II)

Source: EU AI Act — Article 26, Regulation (EU) 2024/1689

Correction Policy

When errors are identified in Evidtrace assessments — whether by internal review, external scrutiny, or vendor response — they are corrected in the next edition. A changelog is maintained for each edition documenting material corrections, additions, and removals.

Vendors may submit corrections, context, or rebuttals via hello@evidtrace.com. Vendor responses are reviewed and, where appropriate, noted in subsequent assessments.

Previous editions remain available for audit trail purposes. Correction history is part of the assessment record.

How Evidtrace Works