Methodology
SCAND.Ai is an automated monitoring system that detects, scores, and tracks AI-industry controversies around the clock. This page documents exactly how the numbers are produced. The factor weights below are rendered from the same configuration file the scoring engine reads — what you see here is what runs.
The noise score (0–100)
Every tracked controversy carries a noise score from 0 to 100 — a weighted composite of seven measured factors, recalculated hourly:
| Factor | Weight | What it measures |
|---|---|---|
| Reach | 0.20 | Audience size reached across platforms (log10 scale — 10× the reach ≈ one step up, not ten) |
| Engagement | 0.20 | Likes, reposts, comments, upvotes — normalized per platform (log10 scale) |
| Star power | 0.15 | Influence tier of the public figures involved (S/A/B/C tiers) |
| Duration | 0.10 | How long the story has sustained coverage (capped at 72h) |
| Cross-platform spread | 0.15 | Number of independent platforms covering the story |
| Polarity | 0.10 | How divided the reaction is (one-sided vs. contested) |
| Industry impact | 0.10 | Model-assessed consequence for the AI industry (jobs, policy, products) |
Scores decay with a 7-day half-life: a story that stops generating new coverage loses half its remaining noise every 7 days. A noise score is therefore a measurement of current loudness, not historical importance. Star power contributes tier-scaled points (S = 30, A = 20, B = 10, C = 5) based on the influence tier of the public figures involved.
The state machine
Each topic moves through five states: trending → spike → controversy → scandal → resolved. Transitions are rule-based, not editorial: a spike requires sustained content velocity, a controversy requires coverage on at least 2 independent platforms, and a scandal requires noise ≥ 75 held for at least 4 hours. Each state has a cooldown (12–48h) and hysteresis decay, so topics do not flap between states on noise fluctuations.
Sources
Content is ingested continuously from public sources:
| Source | Coverage | Crawl cadence |
|---|---|---|
| Hacker News | front page + new, AI-filtered | every 15 min |
| Google News (via search API) | rotating AI-controversy queries | every 15 min |
| 29 AI and tech subreddits | every 30 min | |
| RSS / Atom | 26 feeds: tech press, AI labs, ArXiv | every 30 min |
| Bluesky | AI-discourse search | every 30 min |
| Polymarket | AI prediction markets | every 30 min |
Every story page lists the specific source items it is grounded in — see the "Sources" block on any topic.
Deduplication (six layers)
The same story arrives many times from many platforms. Before anything is published, content passes a six-layer dedup pipeline:
- Exact fingerprint — SHA-256 content hash.
- URL identity — cross-source URL hash.
- Semantic similarity — 384-dimension embeddings with calibrated cosine thresholds (auto-merge only above 0.99).
- Lexical near-duplicate — SimHash with Hamming distance ≤ 6.
- Entity-window dedup — same entities within a 6-hour clustering window.
- LLM judge — borderline pairs are adjudicated by a language model; uncertain cases escalate to human review rather than auto-merging.
Analysis & safety
Clustered stories are analyzed by a language model (Google Gemini) that produces the summary, key points, parties, and forecast shown on each page. Every generated text passes an allegation-language validator before publication: unadjudicated conduct must be attributed to the party alleging it ("accused of X by Y", "according to the complaint") and may never be stated as established fact. Analyses that fail validation are not published. Stories carry a safety version stamp, and only validated stories enter our feeds and machine-readable exports. Errors are handled under our corrections policy — substantiated factual errors are corrected or removed within 48 hours.
Update cadence
- Ingestion: every 15–30 minutes per source (table above).
- Noise recalculation: hourly, across all active topics.
- Analysis: triggered when a new cluster forms or an existing story escalates.
- Decay & state transitions: evaluated hourly with the cooldowns described above.
Citing SCAND.Ai
Cite the topic URL and the date — noise scores change over time by design, so a citation should always carry its "as of" date. Machine-liftable statistics live on /stats; the downloadable dataset and its license are described on /data.
See also: About & editorial process · Corrections policy · Live feed