Content Audit

Glossary

Definitions, methodology, and behavior reference for the audit pipeline.

Quality Scoring

Quality scoring is a two-step hybrid — deterministic Python metrics, then a tightly-constrained AI verdict.

Step 1: Algorithmic metrics (structural_score)

Computed locally with no AI — fully reproducible run-to-run:

MetricWhat it measures
word_countTotal words in the extracted main content
freshness_daysDays since the page's modified_date (or published_date if no modified)
internal_link_count / external_link_countLinks inside the body
headings_by_levelCount of H1/H2/H3/etc. headings
paragraph_count / avg_paragraph_wordsRough readability proxy
readability_fleschFlesch Reading Ease score (0–100, higher = easier)
thin_contentTrue when word_count < 300
structural_completeness0–1 score: % of {title, meta_description, H1, canonical, breadcrumbs} present
detected_languageISO 639-1 code from lingua-language-detector run on title + H1 + first 2000 chars of body
language_mismatchTrue when declared hreflang differs from detected (confidence ≥ 0.85), OR no hreflang and detected is not English

Stored as a JSONB blob in pages.structural_score.

Step 2: AI verdict (quality_score + quality_verdict)

Claude Sonnet 4.5 receives:

  • The full structural_score JSON above
  • The first 4,000 characters of the page's main content
  • The page's primary tag (first extracted tag, if any)

The prompt explicitly tells the model:

  • Penalize: thin content (<300 words), missing structural elements, outdated content (>365 days), low readability (<30 Flesch), promotional padding, off-topic content.
  • Reward: depth, completeness (title + meta + h1 + canonical + breadcrumbs), recent updates, clear structure, on-topic relevance.

It must return exactly two lines:

8
This page has comprehensive coverage with clear structure and recent updates,
but the meta description could better highlight the unique angle.

The integer becomes pages.quality_score (clamped to 1–10), the second line becomes pages.quality_verdict. Temperature is 0 to maximize consistency between runs.

Language mismatch — hard cap on score

When language_mismatch=true, the AI score is capped at 3 regardless of other signals, and the verdict text gets a "Language mismatch" prefix. This makes mismatched pages float to the top of the Quality Check view (Pages → Quality) so they get triaged first. A page in the wrong language is functionally broken — fixing it before scoring other dimensions is the right order.

Caching

Once a page has a quality_score, the Score (cached) button skips it. Use Force re-score all to re-run regardless. Re-crawls automatically reset scores when the page's content_hash changes (handled by the cache layer).

Fallback (when AI is unavailable)

If the Claude call fails, a pure heuristic fills in:

  • Start at 5
  • −2 if thin content
  • +1 if structural_completeness ≥ 0.8
  • +1 if freshness < 365 days
  • −1 if freshness > 730 days
  • Clamped to 1–10

Verdict text: "Heuristic fallback (AI unavailable)."

Where it shows up

  • Pages table on the site detail page — Quality column shows the integer
  • Excel export — both score and verdict in the Pages sheet
  • Similarity outcomesquality_score ≥ 7 routes low-overlap pages to no_action_quality; lower scores route to no_action_improve

What it does NOT consider (yet)

  • GSC traffic (intentionally separate signal — gsc_clicks_12mo is not a quality input)
  • GA4 engagement / bounce
  • Sentiment, brand voice, factual accuracy

Similarity & Cannibalization

Identifies duplicate, near-duplicate, and cannibalizing pages using four independent signals computed in parallel: title embedding, headings/outline embedding, chunk-level embeddings, GSC keyword overlap, and tag-set overlap. Any combination can flag a problem; together they route each pair into one of 9 outcomes.

Mental model: Similarity vs Quality (Sprint 39)

As of Sprint 39 the Similarity page is a pure merge-decision surface. Pages appear here only when they have meaningful overlap with at least one other page. Pages with good quality and no overlap — or weak quality and no overlap — don't clutter this view; they live on the Quality page instead. The Similarity summary shows a line like "N other URLs analyzed but not flagged as merge candidates" so it's clear those pages aren't forgotten — they're just out-of-scope for merge work.

The classifier still emits no_action_quality and no_action_improve outcomes for pairs where overlap is below the actionable floor — these remain in page_similarity rows for future analysis. They just aren't rendered as tabs on the Similarity page anymore.

A URL filter sits on the Similarity page controls (since Sprint 39): type a URL or title substring to find a specific page and see which cluster it landed in — useful for spot-checking why a particular page wasn't flagged for merge, or confirming its peer set.

Quality-page tabs: Protect + Purpose or Purge (Sprint 39.2)

Protect surfaces pages worth defending: quality_score ≥ 7 AND has traffic (gsc_clicks_12mo > 0 OR ga_pageviews_12mo > 0) AND not in any merge cluster. These are the wins — pages earning real visits with no consolidation concern. Used during planning to identify content to build on or protect from accidental redirects.

Purpose or Purge surfaces pages to review: zero GSC clicks AND zero GA4 pageviews AND not in any merge cluster. Nobody reaches them through any channel. The analyst decides per page: does it serve a deliberate purpose (legal, brand, product docs, regulatory), or should it be purged (archived, consolidated, redirected, deleted)? Removing low-value pages lifts the overall site quality signal and reduces crawl budget waste.

A per-page boolean pages.is_archival (set during Score runs) caches the traffic-below-archival_floor verdict for downstream tools. The default floor is 50 combined clicks + pageviews over the 12-month lookback, tunable via similarity_config.archival_floor. The Quality page's Purpose-or-Purge filter uses the live GSC + GA4 columns directly (not the cached is_archival) so it stays accurate between Score runs if traffic data syncs independently.

Embedding model

We use BAAI/bge-small-en-v1.5 (384-dim) — pinned to a specific revision hash so embeddings are reproducible across deploys. Each page produces three vectors:

  • Title vector — embedding of the page title only
  • Headings vector — embedding of title + " | " + each in-body H1/H2/H3 (the page's topical outline)
  • Chunk vectors — body text split into ~400-word chunks with 50-word overlap (recursive splitter that prefers paragraph → sentence → word boundaries). Each chunk gets its own embedding.

Contextual chunking: chunk embeddings are encoded from {title}\n{meta_description}\n{heading}\n{content}, not just the chunk text. The title + description prefix anchors every chunk vector in page intent, so the bi-encoder sees "this section, of this article" rather than judging chunks as standalone text. Catches cases where two pages share a generic intro chunk but cover different subjects. After upgrading existing data, run Re-embed chunks with context from the Advanced dropdown to backfill.

The three vector types cover three levels of granularity: headline, structural outline, and passage-level content. This is conceptually similar to multi-vector retrieval (ColBERT-style) — useful because two pages can share an outline but diverge on body, or vice versa.

Boilerplate auto-quarantine

Site-wide chunks (sidebar widgets, "Related Posts" lists, footer blurbs) get flagged as is_boilerplate=true when their content_hash appears on ≥30% of the site's pages (configurable). Boilerplate chunks are excluded from similarity comparisons, idea-checker matching, and quality scoring. Without this, every blog post on a site looks similar because they share "Further Reading" widgets.

Pairwise signals

For each candidate pair we compute:

  • title_similarity — cosine of title vectors (0–1).
  • headings_similarity — cosine of headings vectors (0–1). Independent signal; high here + low chunk overlap = same outline, different writeups.
  • chunk_overlap_score — fraction of A's non-boilerplate chunks that have at least one B chunk above chunk_threshold (default 0.80 cosine).
  • tag_overlap_score — Jaccard of extracted_tags sets (0–1). Catches editorial intent overlap when content phrasing diverges.
  • impression_overlap_score (primary cannibalization signal — Sprint 14.1) — impressions-weighted Jaccard on shared GSC queries. Google showing both pages for the same query is the cleaner signal that they compete; clicks just say which one users picked.
  • shared_impressions — sum of min(A_impr, B_impr) across shared queries. Filtered by cannibalization_impressions_floor (default 500/month).
  • query_overlap_score — clicks-weighted Jaccard, kept as a reinforcing signal. High click overlap with low impression overlap means users strongly prefer one page.
  • cannibalization_clicks — absolute shared-query clicks; informs verdict text and primary picker.
  • position_competition_flag — true when both pages rank top-20 on ≥5 shared queries.
  • analytics_confidencefull (both have GSC) / partial / embedding_only.

Brand-keyword exclusion (Sprint 14.3)

Each site has a brand_terms list (Settings → Analytics → Brand keywords). Any GSC query whose text contains a brand term is marked is_brand_query=true at sync time and excluded from the cannibalization computation. Brand traffic naturally pulls all branded pages — that's expected user behavior, not cannibalization, and including it would falsely flag every branded landing page as competing with every other one.

Candidate generation (how pairs get picked)

A naive O(n²) scan would compare every page to every other page. Instead we:

  1. Build a single (N × 384) numpy matrix of all title embeddings.
  2. Compute the full pairwise cosine matrix in one vectorized operation.
  3. For each page, keep its top-50 nearest neighbors (candidate pool) above a minimum cosine of 0.50.
  4. Also include any pair where both pages have GSC queries — cannibalization can hit visually different titles.
  5. Compute the full per-pair signals (chunk, query, tag, headings) only for those candidates.

For a 1,000-page site this is ~50,000 pair evaluations instead of ~500,000 — fast and scales linearly with site size after the matrix multiply.

8-outcome decision matrix

EmbeddingQuery / tag overlapOutcomeAction
HIGH (≥0.8)HIGH (GSC cannibalization)strong_consolidateActive cannibalization. Merge urgently, redirect, fold unique content.
HIGHNOT HIGHcontent_dupe_no_seoSame content, no active cannibalization. Lower-priority housekeeping merge.
LOW (<0.4)HIGH (GSC cannibalization)topical_confusionGoogle ranks both pages for the same queries despite different content. Refocus intent or consolidate.
MEDIUM (0.4–0.8)ANYpartial_dupe_splitFold each overlapping chunk into its strongest match; keep unique chunks. (Requires title OR headings sim ≥ the partial_dupe_*_gate AND the source page doesn't fan out to too many peers — see Peer-spread filter below.)
LOWLOW + title sim ≥ 0.85 + zero traffic both sidesnear_dupe_titleSame headline, dead traffic — consolidate archival/legacy duplicates.
ANY (thin page)ANY + any title / content / tag signalthin_content_candidateSprint 36 — thin-page pairs with real signal route to Quality → Thin content with 'potential parents', not the merge queue. Fold or expand the thin page instead of redirecting.
LOWLOW + quality ≥ 7no_action_qualityGood and unique, leave alone. (Sprint 7.2: absorbed the former unique_standout outcome — has_unique_chunks didn't carry independent signal.)
LOWLOW + quality < 7no_action_improveNo duplication, but quality is poor — improve standalone.

Tags: additive only, never a trigger (Sprint 37)

Tag overlap is metadata — it describes what a page is about, not whether it competes in SERPs. Two posts tagged social-media are about social media; that does not mean Google ranks both for the same queries, nor that they duplicate each other's content. On blog archives (Hive Digital, agency sites) tags are sitewide and nearly every post shares them, which would fire endless false-positive topical_confusion / strong_consolidate pairs.

Pre-Sprint-37 the classifier let tag_high (tag_overlap ≥ tag_overlap_threshold) trigger outcomes by itself. As of Sprint 37, tags are never a standalone trigger. They can appear in detail strings as corroborating commentary when a real signal has already fired, but the decision is always driven by content overlap + GSC cannibalization. Urgency = real cannibalization; housekeeping = duplicated content without SERP competition.

Peer-spread filter (pillar-page detection)

The outcome matrix evaluates one pair at a time — it can't tell whether a page shows up as a partial_dupe_split source against 1 peer or 15 peers. A page that fans out to many partial-dupe peers is almost always a pillar / roll-up (table-of-contents over 5+ subtopics, each with its own deep-dive), not a realistic split candidate. You can't fold one page's content cleanly into 6 destinations.

After classification we count each page's partial_dupe_split peer set across the site. Any page exceeding partial_dupe_max_peers (default 3) is marked a pillar, and every partial-dupe pair touching it gets demoted to no_action_improve so those hours of human audit time don't get wasted. Set to 0 in settings to disable.

Minimum-chunks guard (thin-page protection)

chunk_overlap_score = matched_chunks / total_chunks_A. If page A has only 1 or 2 non-boilerplate chunks and one of them matches a sitewide widget chunk on 80 other pages, overlap hits 1.0 (100%) and the pair falsely fires as content_dupe_no_seo. This is exactly how the Hive Digital "Learning" sidebar was producing 81-peer false positives.

Guard: when either page has fewer than min_chunks_for_dupe non-boilerplate chunks (default 3), effective chunk overlap is zeroed UNLESS title or headings gate also corroborate. Thin pages can still flag as near_dupe_title or other title-driven outcomes — they just can't fire duplicate flags on chunk count alone.

Sprint 36 — Thin-content intercept. Before the guard zeroes overlap, we now check whether the thin pair still has real signal: ≥ min_matched_chunks_for_dupe matched chunks with chunk_overlap ≥ 0.5, or title similarity ≥ 0.7, or tag overlap ≥ tag_overlap_threshold. When any of those fires on a thin pair, the outcome becomes thin_content_candidate — surfaced in Quality → Thin content with a "potential parents" list instead of dumping the pair into the merge queue. Thin pages are rarely redirect targets and rarely redirect sources; the right call is fold into a stronger parent or expand the thin page, both of which happen on the Quality page, not Similarity.

Path affinity (editorial context check)

URL structure encodes editorial intent cheaply. Two /blog/* posts competing on the same topic is meaningfully different from /services/seo "competing" with /blog/how-seo-works — the latter is expected supporting content, not a merge candidate. Content cosine doesn't know this; path affinity does.

Two signals combine into one 0-1 score per pair:

  • Section affinity — structural path-prefix comparison. Identical = 1.0, hierarchical (one path is ancestor of the other) = 0.7, same first segment = 0.6-0.9, different first segment = 0.3.
  • Slug affinity — Jaccard on leaf-segment tokens (split on - and _, stop words + numerics filtered). /blog/foo-bar vs /blog/lorem-ipsum-foo-bar shares {foo, bar} and scores higher than /blog/foo-bar vs /blog/lorem-ipsum.

When the combined score is below path_affinity_floor (default 0.5), the classifier demotes — content_dupe_no_seopartial_dupe_split, partial_dupe_splitno_action_*. Never promotes, only dampens. Flat sites (all URLs at root) always compare same-section so the signal is a no-op there. We use literal token Jaccard, not embeddings, because title cosine already captures semantic slug similarity; this is the cheap deterministic editorial-context check.

Query impression floor

Old behavior kept the top 50 GSC queries per page by clicks. That dropped low-volume queries below the cutoff — but a query with 1 impression/month × 12 months = 12 impressions is still a legitimate cannibalization candidate if another page also ranks for it. Floor is now applied AFTER monthly roll-up, on aggregated impressions, and any query above the floor is stored (no top-N cap).

Tunable via query_impression_floor (default 5). Lower = more candidate queries stored + more noise. Higher = only established traffic queries qualify.

Cluster formation

Pairs with outcome strong_consolidate, content_dupe_no_seo, partial_dupe_split, or near_dupe_title are linked together via union-find — transitive closure groups them into clusters. A page that's in a partial dupe with both X and Y will end up in the same cluster as X, Y, and X's other partial-dupe siblings.

Primary-page picker (for merges) — v3 weights (Sprint 38)

When consolidating a cluster, a weighted score chooses the survivor:

  • 30% normalized GSC clicks (12-month)
  • 20% normalized avg position (lower position = higher score)
  • 10% normalized unique-query count (more distinctive topics)
  • 15% normalized CTR on the cluster's shared queries (Sprint 14.2) — pages users prefer for the contested queries
  • 10% normalized quality score
  • 15% normalized GA4 pageviews (12-month) (Sprint 38) — captures non-organic traffic (email, social, direct) so email-campaign-driven pages don't lose merge ties to pages that rank but don't convert clicks

Falls back to highest word count if no signals are available. The survivor is the redirect target; everything else folds into it.

Dynamic rebalance: when a site is missing a signal (no GA4 sync, no GSC data, no quality scoring, etc.) that signal's weight drops and the remaining weights renormalize to sum to 1.0. A GSC-only site uses the pre-Sprint-38 weights exactly; a full-signal site gets the new 30/20/10/15/10/15 split; partial-signal sites fall somewhere in between. No dead weight ever.

Merge-protection badge (Sprint 38)

After the picker chooses a primary, we compare combined traffic (GSC clicks + GA4 pageviews, 12-month) across every page in the cluster. If any non-primary page has combined traffic at least merge_protection_gap_pct higher (default 20%) than the chosen primary — AND exceeds merge_protection_min_traffic (default 50) — the pair gets a ⚠ Merge protection badge on the Similarity page.

The badge is a warning, not a demotion: the outcome and cluster are unchanged. It tells the analyst "the picker chose page A, but page B in this cluster actually has more traffic — confirm before redirecting, you might be losing an active email/direct/social channel." Human validation workflow can resolve.

Tunable thresholds

All configurable per-site (Settings → Advanced or the gear icon next to Analyze):

  • title_threshold — minimum title cosine to compare pages (default 0.75)
  • chunk_threshold — minimum chunk cosine to count as overlapping (default 0.80)
  • chunk_overlap_threshold — fraction of chunks that must overlap to flag duplicate (default 0.60)
  • tag_overlap_threshold — Jaccard threshold for tag overlap to reinforce/shift outcomes (default 0.5)
  • cannibalization_clicks_floor — minimum shared clicks/month before flagging (default 50)
  • impression_overlap_threshold — impressions-weighted Jaccard floor for cannibalization (default 0.20)
  • cannibalization_impressions_floor — minimum shared impressions before flagging (default 500)
  • partial_dupe_title_gate — partial_dupe_split needs title sim ≥ this OR headings sim ≥ headings_gate (default 0.5)
  • partial_dupe_headings_gate — paired gate on headings outline similarity (default 0.5)
  • partial_dupe_max_peers — peer-spread filter: pillar pages demoted when > this (default 3, set 0 to disable)
  • query_impression_floor — min aggregated impressions to store a GSC query for cannibalization math (default 5)

Embedding model upgrades

When the model name (or pinned revision) changes, existing stored vectors are in the old model's space. Cross-comparing old and new vectors gives meaningless results. The Re-embed site action (Advanced dropdown) re-encodes every title, headings, and chunk vector with the current model. pages.embedding_model tracks each row's model so the job is resumable — re-embed can be killed and restarted without redoing work.

Link extraction: editorial vs template

Every crawled page has its outbound links captured two ways:

  • body_links — editorial links from the extracted main_content (after strip-selectors remove nav/footer/sidebar). This is the "this page chose to link there from real content" signal.
  • template_links — everything else (nav, footer, sidebar widgets, related-posts blocks, etc.). Captured from a pristine HTML parse before strip runs, then set-differenced against body_links by href.

Inbound counts follow the same split:

  • inbound_link_count — full graph (template + editorial). This is what you want for site-wide reachability and most visualizations. If a URL appears only in the global footer, it still shows positive inbound here.
  • inbound_main_content_link_count — editorial-only subset. This is the "how many pages CHOSE to link from real content" signal. Zero = orphan in the editorial sense even if global nav covers it.

Quality scoring uses body_links directly for its internal/external link count (editorial "depth"). Primary-page picker doesn't use inbound counts today; could factor in editorial count later for authority weighting.

Where it shows up

  • Clusters page — grouped by primary URL with expandable peers showing all four signal scores
  • Page detail panel → Similarity tab — peers for one page with Dominant badges on cluster primaries
  • Merge Clusters sheet (XLSX) — groups of pages with primary-page recommendation
  • Similarity Pairs sheet (XLSX) — every flagged pair with all signal scores

What it does NOT consider (yet)

  • Lexical/BM25 signal — purely neural right now, so rare technical jargon shared between pages might not surface (planned: Sprint 13, parked).
  • Per-page topic extraction — an LLM-assisted "what is this page about" signal that would feed topic-overlap math (planned: Sprint 16, cost-gated).
  • Internal link graph — inbound-link counts not yet factored into primary-page scoring.
  • Backlink authority (no external SEO data).

AI Validation

A targeted AI pass that flags title / tag / content mismatches on a per-page basis. Where the quality scorer asks "is this page well-built?" and similarity asks "does this page collide with another?", validation asks "does this page deliver what its title and tags promise?"

What it actually does

For each page, sends Claude a constrained prompt with the page title, extracted_tags, and a sample of main_content. Asks: does the body match what the title + tags promise, or is there a meaningful divergence? Output is parsed into structured rows in ai_validation_flags — never free-form prose, so the results are filterable.

Typical flags

  • Title mismatch — title promises X, body delivers Y (e.g. "Complete guide to…" on a 200-word stub).
  • Tag mismatch — tagged as belonging to topic A but content is mostly about topic B. Pollutes tag landing pages and confuses the topical model.
  • Topic drift — body wanders off the subject the title set up.

When to run it

  • After Quality + Similarity, on the pages you're actually planning to keep. No point validating pages you're about to consolidate or delete.
  • On a sample first if cost matters — Validate is the most expensive recurring pass.
  • Before recommending tag-page consolidation to a client — surfaces the "wrong tag" cases that would make a tag merge worse, not better.

Cost

~$0.003 per page (one Claude call per page, no caching across pages). On an 800-page site that's ~$2.50. Tracked in the History page's estimated_cost column.

Content Ideas

The Ideas tab takes a proposed content idea (title or topic) and compares it against the site's existing inventory via title + chunk embeddings. Each idea gets a list of existing pages ranked by semantic similarity, and a bucket label that summarizes what to do about it.

Classification tiers (per match)

  • already_covered — similarity ≥ 0.85. Existing page substantially covers the topic.
  • closely_related — similarity 0.60 – 0.85. Existing page is topic-adjacent but not the same angle.
  • loose_match — similarity < 0.60. Tangential; the idea is mostly net-new.

Buckets (Sprint 42) — one label per idea, derived from the top match

  • Subtle Revision — top match is already_covered. Refine the existing page; don't create a duplicate.
  • Expand or Create — top match is closely_related. Extend the existing page OR ship complementary new content with clear differentiation.
  • New Content — top match is loose_match (or no matches). No existing page has strong alignment; write a new piece.

Top-match-wins: the single highest-scoring match drives the bucket. The full match list stays exposed in the row detail so the analyst sees every related page even when the bucket is headline-level.

Sources + freshness

Each idea tracks its sources — where it came from. user means submitted via the tool's Check Coverage form. Future imports from prompt-clustering / prompt-generator land with their respective source tag. An idea can carry multiple sources over time (e.g. user-entered then later confirmed via a clustering run) so the row stays canonical instead of duplicating.

last_compared_at tracks when the idea was last scored against the current inventory. On inventory change (new pages added, old pages edited) the match scores drift. A future Refresh idea scoring button (Sprint backlog) will re-run every idea through the matcher without requiring re-import.

Views

  • Group by idea (default) — each idea is its own card with its ranked match list.
  • Group by URL — pivots to show ideas stacked under their top-match URL. Answers "how many ideas are aiming at this page?" at a glance. New-Content bucket ideas (no strong URL alignment) appear in a separate group at the bottom.
  • PageDetailPanel → Ideas tab — from any page's detail view, see ideas that matched this page. Default filter: already_covered + closely_related. Loose matches behind an expander.

Phased Analysis (Fast / Standard / Deep)

Analysis is tiered so you can stop at the cheapest level that gives good enough results, and only pay (in time or LLM cost) for deeper passes if you need them. Each phase enriches the same pair rows rather than recomputing — running Phase 2 doesn't throw away Phase 1's work.

PhaseAddsTime (1k pages)Cost
Phase 1 — FastTitle cosine, headings cosine, chunk overlap (boilerplate excluded), tag overlap, GSC impression + click overlap (with brand exclusion)~30-60 sec$0 (local)
Phase 2 — Standard+ Cross-encoder reranker score on candidate pairs (cross-encoder/quora-roberta-base, trained on Quora Question Pairs duplicate detection). Joint attention over both texts — decouples "coincidental cosine overlap" from "actually near-duplicate". Promotes or demotes pairs based on reranker confidence.+30-60 sec$0 (local model)
Phase 3 — Deep+ BM25 lexical signal + per-page topic extraction + topic overlap (Sprints 13 + 16 — planned)+2-5 min~$5-10 per 1k pages (LLM)

How to invoke

On the site detail page, the Analyze button is a split button. The main click runs a complete analysis — Phase 1 followed automatically by Phase 2 (Sprint 30 auto-chain). The chevron opens a troubleshooting dropdown where you can run any single phase without chaining, for debugging or re-running one phase in isolation.

"Re-analyze all" always starts from Phase 1 with force=true (deletes existing similarity rows + clusters, recomputes from scratch) and then auto-chains Phase 2. Use this after changing thresholds or classifier logic — it's the only way to guarantee every pair is re-evaluated against the new rules.

  • Phase 3 prompts for cost confirmation the first time you run it on a site (one-time opt-in stored on sites.deep_analysis_opted_in). Phase 3 is dropdown-only and does not auto-chain — user always picks it explicitly.
  • The status line under the button shows last: Phase X — which depth you've reached.
  • Each pair row tracks page_similarity.phase so you can see in the Detail panel which depth produced the current outcome.
  • Auto-chain is implemented backend-side: the Phase 1 job calls _run_phase2_reranker inline once Phase 1 completes successfully, producing two site_jobs rows (analyze_phase1 + analyze_phase2) and a contiguous progress story.

When to use which

  • Phase 1 — default for most audits. Catches obvious dupes + cannibalization. Run after every crawl.
  • Phase 2 — when Phase 1 has too many partial_dupe_split or borderline pairs you can't confidently action. The reranker pushes ambiguous cases off the fence.
  • Phase 3 — for technical clients with jargon-heavy content (BM25 catches term overlap that neural cosine blurs), or for high-stakes consolidation decisions where you want topic-level confirmation. Skip otherwise — costs real money.

Currently shipped

Phase 1 and Phase 2 are fully active. Phase 3 button currently marks rows as having reached that phase but doesn't yet add new signals — becomes functional as Sprints 13 + 16 ship (see the roadmap section below).

Phase 2 deep-dive — why a cross-encoder beats cosine

Phase 1 uses a bi-encoder: every title and chunk gets encoded to a 384-dim vector independently, and we compare vectors with cosine distance. Fast (microseconds per pair) and general-purpose, but the model never sees both texts together. It can only answer "are these two vectors close in embedding space?" — not "are these two specific texts about the same thing?" Two pages with shared boilerplate, shared intros, or identical headlines but different bodies all score "similar" under cosine.

Phase 2 uses a cross-encoder (cross-encoder/quora-roberta-base): both pages are fed into the transformer together, and attention layers let every word in page A look at every word in page B while forming the judgment. Output is a single 0–1 duplicate-confidence score. The model is trained on Quora Question Pairs — explicitly a duplicate-classification task, not retrieval reranking — so its scores answer "are these the same article?" directly.

Score interpretation: 1.0 = the model is maximally confident the two texts are duplicates; 0.0 = maximally confident they're not. Calibration anchors: identical text scores ~0.84, distinct topics in the same domain score 0.00–0.06. The thresholds below (0.75 / 0.55 / 0.10) are the bands the classifier acts on. Values in the middle (0.11 – 0.74) keep the Phase 1 verdict — the reranker isn't confident enough to overrule.

Model history: we previously used BAAI/bge-reranker-base (an MS MARCO retrieval reranker). It scored topical relevance ("would B be a relevant answer to query A?") instead of duplication, producing 0.85+ false-positives on any two SEO blog posts that shared generic "answer-passage" framing. Replaced with quora-roberta-base in Sprint 24 after the bge model produced wildly inflated strong_consolidate clusters on Hive Digital test data.

The cost is ~100× slower per comparison (10–50 ms vs. microseconds), so we only run it on candidate pairs Phase 1 has already flagged — specifically partial_dupe_split and near_dupe_title outcomes, where bi-encoder cosine is ambiguous enough that the extra signal actually changes decisions. For a 1,000-page site that's a few hundred pairs and ~30–60 seconds total.

Phase 2 runs on every actionable outcome (strong_consolidate, content_dupe_no_seo, topical_confusion, partial_dupe_split, near_dupe_title) — not just borderline ones. High reranker is corroboration (gets us closer to "100% confident, act on this"). Low reranker catches false positives across all outcomes, which matters especially for content_dupe_no_seo where rotating sidebar widgets can produce 100% chunk_overlap on thin pages and slip past the boilerplate detector.

Reclassification rules (by Phase 1 outcome):

  • strong_consolidate + reranker < reranker_strong_confirm (default 0.65) → content_dupe_no_seo. Phase 1 found query/tag + content signals but the reranker didn't reaffirm — lower-priority review, not urgent merge. Tunable in Settings → Advanced.
  • content_dupe_no_seo + reranker ≥ 0.75 AND query_high → strong_consolidate. Reranker + query signal corroborate active competition.
  • content_dupe_no_seo + reranker ≤ 0.10 → no_action_*. Boilerplate/widget chunk slipped past the auto-detector; pages are actually distinct.
  • topical_confusion + reranker ≥ 0.75 → strong_consolidate. Bodies ARE near-duplicates after all — Phase 1 was fooled by aggressive content stripping.
  • topical_confusion + reranker < 0.40 + zero Phase 1 overlap signal (chunk_overlap, query_overlap, AND impression_overlap all effectively zero) → no_action_*. Pair was classified entirely on tag overlap; without content corroboration from the reranker either, almost certainly noise. Fires before the 0.10 demotion to catch the middle band too.
  • topical_confusion + reranker ≤ 0.10 → no_action_*. Bodies differ; shared queries are likely template-driven double-listings or Google noise.
  • partial_dupe_split + reranker ≥ 0.75 → content_dupe_no_seo (or strong_consolidate if query_high).
  • partial_dupe_split + reranker ≤ 0.10 → no_action_*.
  • near_dupe_title + reranker ≥ 0.55 → content_dupe_no_seo (or strong_consolidate if query_high).
  • near_dupe_title + reranker ≤ 0.10 → no_action_*.
  • Middle-band scores: keep the Phase 1 verdict, annotate the pair with the reranker number.

When a pair is reclassified, Phase 1's verdict is preserved in page_similarity.phase_outcomes so the UI can show "Phase 1 said X; reranker reclassified to Y." Running Phase 2 a second time is idempotent — it rescans but doesn't double-demote.

Within-cluster peer ordering: on the Similarity page, peers inside an expanded cluster are sorted by reranker_score (descending), with cannibalization clicks as tiebreaker. Peers without a reranker score (Phase 1-only sites or non-rerankable outcomes) sink to the bottom. Peers with reranker_score below 0.60 are hidden by default behind a "Show N lower-confidence peers" toggle — flip the page-level checkbox to reveal them everywhere at once. Note: the 0.60 frontend hide threshold is calibrated for the quora-roberta model; under the old bge model it would have been ~0.85.

Methodology — how comparisons work

The audit's primary goal is identifying pages that are (a) loosely related — internal-link opportunities, (b) heavily related — consolidation candidates, or (c) functional duplicates — merge + redirect. This section explains how we're doing that and how the approach maps to modern retrieval/RAG techniques.

Multi-vector representation

Modern retrieval systems (Google, ColBERT, Splade, etc.) don't represent a document as a single embedding — they use multiple vectors at multiple granularities. We do the same:

GranularityWhat it representsWhy it matters
title_embeddingOne vector for the page title
headings_embeddingOne vector for the title + body H1/H2/H3 outline
page_chunks.embeddingOne vector per ~400-word body passage
Two pages with the same title but different bodies fail title-only matching but get caught by chunk overlap. Two pages with totally different titles but identical "How to do X" instructions get caught by chunk overlap when title misses them. Headings sit between — outline alignment without phrasing alignment.

Two-stage retrieval

Industry-standard pattern: fast candidate generation, then careful re-scoring. We do this with the title-vector matrix, taking each page's top-50 nearest neighbors as candidates rather than scoring all N² pairs. Comparing every page to every other one would be wasteful and slow; the cosine-by-matrix-multiply step scales as O(N²) but with a tiny constant, then per-pair signal computation only runs on the K most plausible neighbors.

Google's search retrieval uses a similar pattern: massive ANN-style first-stage retrieval over the index, then deep neural re-ranking on the top few hundred. Our approach is the same shape at smaller scale.

Multi-signal fusion

A single similarity number is too brittle. Two pages can be:

  • Same content, different titles — chunk overlap catches it, title doesn't.
  • Same title, different content — title matches, chunks don't. Often archival pages where the original argument was rewritten.
  • Different content, same Google queries — pure cannibalization (Google sees them as the same topic). Only catchable with GSC data.
  • Different everything but same editorial tags — the editor classified them as covering the same beat. We catch this with tag overlap.

The 8-outcome matrix combines all four signals to produce one of actionable, investigatory, or safe-to-ignore labels rather than a single score that hides the reason.

Eliminating noise

Two systematic noise sources kill naive comparisons:

  • Boilerplate. Sidebar widgets, footer text, and "Related Posts" lists appear on every page — naive chunk comparison would say every blog post is identical. We auto-detect these by content-hash frequency (≥30% of pages) and skip them.
  • Long-tail GSC queries. Two pages might share a single click on "x" — not cannibalization, just noise. We require cannibalization_clicks ≥ 50/month before flagging.

Determinism

Same input → same output. The embedding model is pinned to a specific revision hash. The chunker is purely algorithmic. The decision matrix is deterministic. The only AI step is the quality verdict text, which uses temperature=0 and constrained output format. This means improvements between rounds are real signal, not LLM variance.

Where we differ from Google's actual retrieval

Google's retrieval is opaque, but their approach almost certainly includes signals we don't have:

  • Click and dwell engagement at scale — they have it, we have only what GSC reports.
  • Backlink authority + brand signals — we'd need a third-party SEO API to add this.
  • Cross-encoder re-ranking — they use it heavily; we have it on the roadmap (parked) for the highest-stakes pairs.
  • Lexical (BM25) overlap as a fallback signal — purely neural now; rare technical terms can be a blind spot.

For "find the next-best page to redirect to"

When you're killing a page for quality reasons and need to pick a redirect target: open the page's detail panel → Similarity tab. The peer list is sorted with the strongest match first (by recommendation priority + cannibalization clicks). Among peers in the same cluster, the one with the green Dominant badge is the cluster primary — the page our weighted scoring chose as the survivor based on traffic, position, distinctive query count, and quality. That's usually your redirect target.

What's coming next

Live look at what we're building next, in priority order. Items get moved to "shipped" as they land in production.

Active queue

SprintWhat it shipsWhy it matters
11.5 — Cross-encoder reranker (shipped)BAAI/bge-reranker-base adjudicates borderline pairs from Phase 1. Wired as Phase 2.Sharper "is this really a duplicate?" judgments than cosine alone. Biggest false-positive killer for the price.
16 — Per-page topic extraction + topic overlapLLM-extracts 3-7 topics per page; new topic_overlap_score signal. Plugs into Phase 3.Catches "same things covered, different writeups" — the case chunk cosine misses. The clearest signal for "should we consolidate?".
13 — BM25 lexical signalPure-Python BM25 over chunks; new lexical_overlap_score. Plugs into Phase 3.Direct value for technical clients (electronics, drug compounds, legal). Catches term-overlap blind spots in pure neural embeddings.
15 — GA4 engagement weightingAdds 10% weight on engagement composite (avg time × inverse bounce) to the primary picker.Pages users actually engage with should win as the redirect target. Median imputation for missing-data pages.
19 — AI grading sidequest (QA tool)Sample 50 pairs across outcomes; ask Claude "consolidate / cross-link / leave?" and compare to our classification. Output precision/recall per outcome.Validates that thresholds are tuned correctly. Used after major signal changes.

Backlog (parked, may activate later)

  • Late Chunking experiment — switch to a long-context embedder (jina-embeddings-v3) so chunk vectors carry document context. Big infra change; promote out of backlog only if topics + reranker don't close the gap.
  • Custom tag extractor — sandboxed Python execution for non-standard tag markup. Parked for safety review.
  • Backlink authority signal — third-party SEO API integration (Ahrefs / Moz / Majestic). Would feed both quality scoring and primary-page picker.
  • Full BERTopic / NMF topic modeling — heavier alternative to LLM-extracted topics; better for sites with consistent jargon vocabularies.
  • Semantic chunking — split where embedding distance between sentence windows spikes. Marginal win over current recursive fixed-size; not a priority.
  • Notes / status preservation across ignore → re-add round-trip — when a page moves out of inventory and is later re-added, restore its action_status + action_notes.

Have a request? Add it to the conversation — we maintain a roadmap that gets updated as we agree on priorities.

More glossary entries will be added as the tool grows. ← Back to sites