Methodology

SCAND.Ai is an automated monitoring system that detects, scores, and tracks AI-industry controversies around the clock. This page documents exactly how the numbers are produced. The factor weights below are rendered from the same configuration file the scoring engine reads — what you see here is what runs.

The noise score (0–100)

Every tracked controversy carries a noise score from 0 to 100 — a weighted composite of seven measured factors, recalculated hourly:

Factor	Weight	What it measures
Reach	`0.20`	Audience size reached across platforms (log10 scale — 10× the reach ≈ one step up, not ten)
Engagement	`0.20`	Likes, reposts, comments, upvotes — normalized per platform (log10 scale)
Star power	`0.15`	Influence tier of the public figures involved (S/A/B/C tiers)
Duration	`0.10`	How long the story has sustained coverage (capped at 72h)
Cross-platform spread	`0.15`	Number of independent platforms covering the story
Polarity	`0.10`	How divided the reaction is (one-sided vs. contested)
Industry impact	`0.10`	Model-assessed consequence for the AI industry (jobs, policy, products)

Scores decay with a 7-day half-life: a story that stops generating new coverage loses half its remaining noise every 7 days. A noise score is therefore a measurement of current loudness, not historical importance. Star power contributes tier-scaled points (S = 30, A = 20, B = 10, C = 5) based on the influence tier of the public figures involved.

The AI Controversy Index ranking

The AI Controversy Index ranks controversies by peak noise — the highest score a story reached during the period, not its current (decayed) noise. Peak captures historical significance, so a resolved scandal that peaked at 90 outranks an ongoing story at 45 — resolved controversies are included.

Scope: a controversy appears in a period (year, quarter, or month) if it was first detected within it; its rank is the peak noise it reached in that window (or its current noise score, when no in-window measurement exists).
Floor: entries below a peak of 25/100 are excluded as a sparse-period backstop. The real quality gate is that every ranked item has a published analysis and has passed safety review.
Tie-break: equal peaks are ordered measured-peak first, then by earliest detection — deterministic, so the ranking never reshuffles arbitrarily.
Editions: living annual (top 25), quarterly (top 20), monthly (top 15).

The state machine

Each topic moves through five states: trending → spike → controversy → scandal → resolved. Transitions are rule-based, not editorial: a spike requires sustained content velocity, a controversy requires coverage on at least 2 independent platforms, and a story already at controversy escalates to scandal on extreme velocity — at least 75 new content items flowing within a single hour. Each state has a cooldown (12–48h) and hysteresis decay, so topics do not flap between states on noise fluctuations.

Sources

Content is ingested continuously from public sources:

Source	Coverage	Crawl cadence
Hacker News	front page + new, AI-filtered	every 15 min
Google News (via search API)	rotating AI-controversy queries	every 15 min
Reddit	29 AI and tech subreddits	every 30 min
RSS / Atom	26 feeds: tech press, AI labs, ArXiv	every 30 min
Bluesky	AI-discourse search	every 30 min
Polymarket	AI prediction markets	every 30 min

Every story page lists the specific source items it is grounded in — see the "Sources" block on any topic.

Deduplication (six layers)

The same story arrives many times from many platforms. Before anything is published, content passes a six-layer dedup pipeline:

Exact fingerprint — SHA-256 content hash.
URL identity — cross-source URL hash.
Semantic similarity — 384-dimension embeddings with calibrated cosine thresholds (auto-merge only above 0.99).
Lexical near-duplicate — SimHash with Hamming distance ≤ 6.
Entity-window dedup — same entities within a 6-hour clustering window.
LLM judge — borderline pairs are adjudicated by a language model; uncertain cases escalate to human review rather than auto-merging.

Analysis & safety

Clustered stories are analyzed by a language model (Google Gemini) that produces the summary, key points, parties, and forecast shown on each page. Every generated text passes an allegation-language validator before publication: unadjudicated conduct must be attributed to the party alleging it ("accused of X by Y", "according to the complaint") and may never be stated as established fact. Analyses that fail validation are not published. Stories carry a safety version stamp, and only validated stories enter our feeds and machine-readable exports. Errors are handled under our corrections policy — substantiated factual errors are corrected or removed within 48 hours.

Update cadence

Ingestion: every 15–30 minutes per source (table above).
Noise recalculation: hourly, across all active topics.
Analysis: triggered when a new cluster forms or an existing story escalates.
Decay & state transitions: evaluated hourly with the cooldowns described above.

Citing SCAND.Ai

Cite the topic URL and the date — noise scores change over time by design, so a citation should always carry its "as of" date. Machine-liftable statistics live on /stats; the downloadable dataset and its license are described on /data.