Methodology
How Stateside Daily turns ~100 headlines an hour from 105+ publishers into the comparison view you see on the home page. This document covers ingestion, deduplication, classification, ranking, and the mechanics that protect the integrity of each.
1. Ingestion pipeline
Every 15 minutes, an ingestion worker fetches the RSS feed of every active publisher in our tenant source list. Each fetch uses a recognizable browser user-agent to slip past anti-bot walls publishers put in front of public RSS, and includes a 12-second timeout to prevent slow feeds from blocking faster ones.
Items returned by each feed go through a small parser that extracts:
- Title — stripped of HTML, normalized whitespace, truncated to 280 characters.
- Canonical URL — the publisher's permalink for the article. We follow redirects so tracking-parameter variations don't create duplicates.
- Summary — the first ~500 characters of the item's description, HTML-stripped. Used as a short fair-use excerpt on cards.
- Image URL — extracted from
media:content,enclosure, orog:imagein the article's HTML head if RSS didn't carry one. - Published timestamp — parsed from
pubDate(RSS) orupdated/published(Atom). Falls back to the fetch time if the publisher doesn't supply one. - Category — assigned by mapping the source's declared topic in our config to one of our taxonomy slugs (politics, business, markets, technology, etc.).
Items already present in the database (matched by canonical URL) are updated in place — title corrections and image swaps from the publisher propagate within 15 minutes.
We never store article bodies. We store only the title, short summary, image URL, and link. Click-through goes to the publisher.
2. Source classification
Every publisher in our list carries two ratings, both reviewed quarterly:
Political lean (5-point scale)
left, lean-left, center, lean-right, right — derived from the consensus of independent media-bias monitors (AllSides, Ad Fontes, NewsGuard) where they agree, and averaged where they don't. We do not assign a lean by reading any single article; we use the publisher-level consensus rating that these monitors derive over months of coverage.
Cases where monitors disagree significantly — typically where one rates a source as lean-right and another as right — are flagged and resolved by sampling 30 randomly-selected stories from the source's last 90 days, scoring on framing language and topic-selection bias, and breaking the tie.
Reliability (3-point scale)
high, mixed, low — based on factual accuracy track record, sourcing discipline, and willingness to correct. Sources with documented patterns of fabrication or persistent editorial-as-news framing are rated low and downweighted in ranking. They are not excluded — we include the full picture so readers can see the spectrum — but they contribute less to hot-score and are visibly labeled.
3. Cluster algorithm
When 3+ publishers cover the same event, we group their coverage into a single cluster page that displays every source side-by-side. Clustering happens in two passes:
Pass 1 — Cluster-ID hash match
For each headline we extract the lead noun phrase + key entity (named person, organization, or place) and hash them. Headlines with matching hashes group immediately. This catches the easy case: most publishers covering the same event lead with the same subject and verb.
Pass 2 — Title-signature merge
Headlines that didn't match in pass 1 go through a second pass that extracts the top-5 significant tokens (excluding stop words and topic markers like "BREAKING"), sorts them alphabetically, and joins them. Two headlines whose signatures share ≥80% of tokens are merged into the same cluster.
This catches the harder case: when the New York Times leads with "Court rejects" and Reuters leads with "Judge denies" on the same ruling, the entity tokens (the case name, the agency, the date) overlap enough to merge.
We deliberately err on the side of not merging: false positives (joining unrelated stories) are worse than false negatives (showing the same story in two clusters) because the comparison view is meaningless if the cluster contains stories about different events.
4. Hot-score ranking
The home page is ordered by a hot-score, not by an editor's pick. Each cluster's score is computed as:
score = ln(sources)
+ ln(1 + clicks) × 0.3
− hours_since_first_seen × 0.4
+ reliability_weightIn plain language: more independent sources covering it boosts the cluster, click-through is a soft secondary signal, recency decay pulls older stories down, and the average reliability of the contributing sources nudges the score up or down. The coefficients are calibrated to keep the home page lively while not letting low-reliability viral content dominate.
We do not rerank for political balance. The spectrum bar on each cluster page exists to make the distribution of coverage visible — it is the transparency mechanism, not a thumb on the scale.
5. Cluster-page composition
On a cluster page (/cluster/[id]), readers see:
- The lead headline and image, picked from the highest-reliability source in the cluster.
- A spectrum bar visualizing how the {N} sources in the cluster distribute across left / center / right.
- Every contributing source, sorted left → center → right, with their headline, lean chip, reliability rating, and
publishedAttimestamp. - Each source row links out to the publisher's article via our
/go/[id]redirect, which records the click anonymously and 302s to the canonical URL. - A "more from {category}" section linking to four related clusters in the same topic.
6. Updates and corrections
Source classifications are reviewed quarterly. Reader-submitted disputes — with evidence, ideally citing a recent monitor update or a specific batch of misclassified articles — are reviewed within 7 business days. Sustained dispute patterns trigger a full re-audit of the source.
Cluster errors (joined unrelated stories, split related ones) get fixed quickly: the cluster is rebuilt from the underlying headlines and the algorithm tuned if a class of errors keeps appearing.
Submit disputes to corrections@statesidedaily.com. See corrections policy for the full SLA.
7. What we are not doing
For the avoidance of doubt: Stateside Daily is not generating article bodies, summaries, headlines, or analysis with AI from scratch. Our work is the analysis layer over publisher RSS — the clustering, classification, and ranking documented above. The text on cluster pages is publisher-authored and excerpted under fair use; readers click through to read the full article on the publisher's site.
8. Open questions
Methodology is never finished. Things we're actively working on:
- Better clustering for opinion vs. news. Currently opinion pieces about an event can join the news cluster. We're building a separate opinion lane.
- Per-story bias signal. Beyond publisher-level lean, individual stories can deviate (a center-left publisher can run a center-right op-ed). Adding a story-level signal is on the roadmap.
- Coverage gap detection. When a major story is covered by all-left or all-right sources only, we want to surface that asymmetry as a reader signal. Currently we just show the spectrum bar; the asymmetry has to be inferred.
If you have questions or methodology feedback, write to feedback@statesidedaily.com.