Esc

Methodology

SCAND.Ai is an automated monitoring system that detects, scores, and tracks AI-industry controversies around the clock. This page documents exactly how the numbers are produced. The factor weights below are rendered from the same configuration file the scoring engine reads — what you see here is what runs.

The noise score (0–100)

Every tracked controversy carries a noise score from 0 to 100 — a weighted composite of seven measured factors, recalculated hourly:

FactorWeightWhat it measures
Reach0.20Audience size reached across platforms (log10 scale — 10× the reach ≈ one step up, not ten)
Engagement0.20Likes, reposts, comments, upvotes — normalized per platform (log10 scale)
Star power0.15Influence tier of the public figures involved (S/A/B/C tiers)
Duration0.10How long the story has sustained coverage (capped at 72h)
Cross-platform spread0.15Number of independent platforms covering the story
Polarity0.10How divided the reaction is (one-sided vs. contested)
Industry impact0.10Model-assessed consequence for the AI industry (jobs, policy, products)

Scores decay with a 7-day half-life: a story that stops generating new coverage loses half its remaining noise every 7 days. A noise score is therefore a measurement of current loudness, not historical importance. Star power contributes tier-scaled points (S = 30, A = 20, B = 10, C = 5) based on the influence tier of the public figures involved.

The state machine

Each topic moves through five states: trending → spike → controversy → scandal → resolved. Transitions are rule-based, not editorial: a spike requires sustained content velocity, a controversy requires coverage on at least 2 independent platforms, and a scandal requires noise ≥ 75 held for at least 4 hours. Each state has a cooldown (12–48h) and hysteresis decay, so topics do not flap between states on noise fluctuations.

Sources

Content is ingested continuously from public sources:

SourceCoverageCrawl cadence
Hacker Newsfront page + new, AI-filteredevery 15 min
Google News (via search API)rotating AI-controversy queriesevery 15 min
Reddit29 AI and tech subredditsevery 30 min
RSS / Atom26 feeds: tech press, AI labs, ArXivevery 30 min
BlueskyAI-discourse searchevery 30 min
PolymarketAI prediction marketsevery 30 min

Every story page lists the specific source items it is grounded in — see the "Sources" block on any topic.

Deduplication (six layers)

The same story arrives many times from many platforms. Before anything is published, content passes a six-layer dedup pipeline:

  1. Exact fingerprint — SHA-256 content hash.
  2. URL identity — cross-source URL hash.
  3. Semantic similarity — 384-dimension embeddings with calibrated cosine thresholds (auto-merge only above 0.99).
  4. Lexical near-duplicate — SimHash with Hamming distance ≤ 6.
  5. Entity-window dedup — same entities within a 6-hour clustering window.
  6. LLM judge — borderline pairs are adjudicated by a language model; uncertain cases escalate to human review rather than auto-merging.

Analysis & safety

Clustered stories are analyzed by a language model (Google Gemini) that produces the summary, key points, parties, and forecast shown on each page. Every generated text passes an allegation-language validator before publication: unadjudicated conduct must be attributed to the party alleging it ("accused of X by Y", "according to the complaint") and may never be stated as established fact. Analyses that fail validation are not published. Stories carry a safety version stamp, and only validated stories enter our feeds and machine-readable exports. Errors are handled under our corrections policy — substantiated factual errors are corrected or removed within 48 hours.

Update cadence

  • Ingestion: every 15–30 minutes per source (table above).
  • Noise recalculation: hourly, across all active topics.
  • Analysis: triggered when a new cluster forms or an existing story escalates.
  • Decay & state transitions: evaluated hourly with the cooldowns described above.

Citing SCAND.Ai

Cite the topic URL and the date — noise scores change over time by design, so a citation should always carry its "as of" date. Machine-liftable statistics live on /stats; the downloadable dataset and its license are described on /data.

See also: About & editorial process · Corrections policy · Live feed