Methodology

How Evidtrace Works

360 experiments. 6 verification layers. Zero tolerance for unverifiable claims.

Evidtrace is an independent AI vendor credibility intelligence platform. Every claim made by AI vendors is extracted, verified against independent evidence, and scored. The methodology behind the engine was developed through 360 structured experiments across real-world news and vendor claims.

Why AI Vendor Claims Need a Different Approach

AI vendors routinely make claims about model capabilities — benchmark scores, reasoning ability, safety metrics — that cannot be independently verified. Press releases are treated as evidence. Self-reported benchmarks are cited as fact.

Traditional fact-checking was built for politics and health. AI vendor claims require a fundamentally different approach: one that understands benchmarks, technical specifications, and the difference between self-reported and independently verified evidence.

24%
of vendor capability claims in our latest edition were independently supported by evidence

The 6-Layer Verification Pipeline

Every claim passes through six independent verification layers before receiving a final credibility verdict. Each layer was calibrated through dozens of structured experiments.

A

Claim Extraction

Layer 1 of 6
  • Automated decomposition of vendor claims into atomic, verifiable statements
  • Identifies the specific claim, the claimed metric, the vendor, and the evidence source
  • Maps claim dependencies — when one claim relies on another being true
1,222
Claims extracted in the latest edition
567
Assessments processed
B

Source Calibration

Layer 2 of 6
Engine insight: 54% of stories had wire-rewrite duplicates — outlets rewriting AP/Reuters copy appear as independent reporting but aren't.
  • Collapses duplicate sources — greater than 70% sentence overlap is treated as a single evidential source
  • Detects PR pass-through content where press release language is republished with minimal editorial change
  • Identifies self-citation loops where a vendor's own claims circle back as "independent" sources
  • Maps anonymous source dependency and applies credibility discounts accordingly
35%
of stories contained near-verbatim press release text
19%
of stories had self-citation loops
31%
of claims anonymously sourced → 15% credibility discount
C

Confidence Intervals

Layer 3 of 6
  • Cross-checks headline figures against body text for numeric consistency
  • Requires both relative and absolute figures for percentage claims — context-free percentages are flagged
  • Currency normalisation to prevent false numeric disagreements across international sources
  • Temporal scoring: provisional verdicts carry time-confidence markers
86%
of headline rounding was in the dramatic direction
42%
of stories had context-free percentages
73%
of 1-hour verdicts matched 24-hour final
D

Independence Check

Layer 4 of 6
  • Maps funding relationships for think tank and research sources cited in vendor claims
  • Flags expertise mismatches — when cited experts are commenting outside their domain
  • Detects official source monopolies where claims rely entirely on a single official statement
  • Screenshot evidence is capped at "supported" — never elevated to "confirmed"
15%
cited potentially conflicted think tanks
12%
had out-of-domain expert citations
68%
official-source dependency in Defence stories
E

Decision Engine

Layer 5 of 6
  • Tracks claim language escalation across news cycles — detecting scope creep without new supporting evidence
  • Cross-references partial quotes against full transcripts for context-altering truncation
  • Distinguishes correlation from causation in science reporting and benchmark claims
  • Treats denials as evidence of position only — never as confirmation of the underlying claim
15%
showed scope creep without new evidence
12%
had context-altering quote truncation
F

Shadow Evaluation

Layer 6 of 6
  • Measures correction propagation — tracking whether corrections reach downstream outlets
  • Detects high-emotion framing and raises the evidentiary threshold accordingly
  • Evaluates claims independently against primary evidence, not prevailing narrative
  • Cross-language verification for international stories to detect translation-induced discrepancies
71%
of corrections went uncarried by downstream outlets
r=0.67
correlation between emotion and spread speed
23%
had narrative lock-in filtering contradictory evidence
30%
had cross-language discrepancies

Built on 360 Structured Experiments

Every rule in the verification engine traces back to a specific experiment. Five research phases, from foundational calibration to live engine testing.

Phase I
Foundation
EXP-001 → 070
Evaluator rubric, claim extraction, source evaluation
Phase II
Deepen
EXP-071 → 130
Deepened findings across all foundation areas
Phase III
Expand
EXP-131 → 200
Real-world systems analysis, competitor review
Phase IV
Build
EXP-201 → 270
Technical specs, MVP roadmap, architecture
Phase V
Engine Testing
EXP-271 → 360
Live verification against real editions

Experiment Categories

Methodology
142
Source analysis
41
Claim validation
39
Narrative tracking
29
UK calibration
29
Edge cases
22
Source weighting
22
Numeric verification
19
Virality analysis
17

Featured Experiments

Six experiments that shaped the verification engine's core logic.

Source Analysis

The Wire Rewrite

54%

of stories had wire-rewrite duplicates. Outlets rewriting AP/Reuters copy appeared as independent reporting — inflating apparent source diversity without adding evidential weight.

Numeric Verification

The Headline Number

86%

of headline rounding was in the dramatic direction. Numbers were consistently rounded to make stories appear more significant, distorting the claims being verified.

Virality Analysis

The Outrage Multiplier

r=0.67

correlation between emotional framing and spread speed. High-emotion stories travelled faster but carried weaker evidence — requiring elevated evidentiary thresholds.

Narrative Tracking

The Correction Deficit

71%

of corrections went uncarried by downstream outlets. Once a claim enters the information ecosystem, corrections rarely propagate — the original claim persists.

Source Weighting

The Anonymous Source Discount

31%

of claims were anonymously sourced. The engine applies a systematic 15% credibility discount to anonymous claims — calibrated against cases where sources were later identified.

Claim Validation

The Narrative Lock-In

23%

of cases showed narrative lock-in — once a dominant interpretation formed, contradictory evidence was filtered out or underweighted by subsequent coverage.


Edition at a Glance

567
Articles Assessed
1,222
Claims Extracted
357
Sources Verified
67
Providers Tracked
12
Categories Covered
24%
Supported
24% Supported 25% Misleading 51% Unverifiable

API Reference

Programmatic access to Evidtrace intelligence. All endpoints return JSON with full provenance metadata.

GET
/api/articles

Query assessed articles with filtering by date, category, provider, and verdict. Returns full article metadata with linked claim IDs.

GET
/api/claims

Search verified claims by keyword, provider, or verdict status. Each claim includes its full verification trail and confidence score.

GET
/api/providers

Provider credibility profiles with historical scores, claim counts, category presence, and trend data across editions.

GET
/api/meta

Edition metadata including assessment counts, date range, methodology version, and aggregate statistics.

EU AI Act Relevance

How Evidtrace supports deployer obligations under Regulation (EU) 2024/1689

Important

Evidtrace is supplementary vendor intelligence designed to inform procurement decisions. It does not satisfy deployer obligations under Article 26 on its own and should be used alongside task-specific validation, operational controls, and legal review.

Article 26 — Deployer Obligations

The EU AI Act imposes specific obligations on deployers of high-risk AI systems. Article 26 requires deployers to:

  • Take appropriate technical and organisational measures to ensure use in accordance with instructions
  • Assign human oversight to natural persons with necessary competence, training, and authority
  • Ensure input data is relevant and sufficiently representative for the intended purpose
  • Monitor operation and report to providers and market surveillance authorities where appropriate
  • Keep logs automatically generated by the AI system for at least six months
  • Inform affected employees and workers’ representatives when high-risk AI systems are used in the workplace
  • Inform affected persons when the system assists decisions about them
  • Accommodate the right to explanation in relevant cases
  • Comply with database registration obligations (for public authority deployers)

Fundamental Rights Impact Assessment

Certain deployers — notably public authorities and organisations providing public services — must conduct fundamental rights impact assessments prior to first use of high-risk AI systems.

Enforcement Timeline

February 2025 — Prohibited AI practices enforcement begins
August 2025 — General-purpose AI model obligations apply
August 2026 — High-risk system requirements (Annex III) and most remaining provisions
August 2027 — High-risk AI embedded in regulated products (Annex I/II)

Source: EU AI Act — Article 26, Regulation (EU) 2024/1689

Correction Policy

When errors are identified in Evidtrace assessments — whether by internal review, external scrutiny, or vendor response — they are corrected in the next edition. A changelog is maintained for each edition documenting material corrections, additions, and removals.

Vendors may submit corrections, context, or rebuttals via hello@evidtrace.com. Vendor responses are reviewed and, where appropriate, noted in subsequent assessments.

Previous editions remain available for audit trail purposes. Correction history is part of the assessment record.