User guide
Task-by-task reference for the two common roles that work in this tool. Always assume these pages are current; if something looks stale, ping the team so the guide gets refreshed.
Last updated: 2026-04-18 · Sprint 27 (resume behavior)
1. Overview
This tool builds a content inventory for a site, enriches it with Google Search Console + Google Analytics 4 data, then runs deterministic and AI analyses to surface quality issues, duplicate/cannibalization risks, and tag/taxonomy mismatches. Everything measurable is computed with deterministic Python so results are reproducible run-to-run; AI is used only for quality verdicts and validation narratives, with results cached by a content hash so re-runs don't re-bill for unchanged pages.
Your job as a consultant is to get a site configured so the audit output is trustworthy. That means picking the right ingestion method (crawl vs CMS import), connecting analytics, setting up tag handling, and tuning thresholds where the defaults don't match the client's content shape.
2. Create a new site
Click New Site in the left sidebar. You only need three things to start:
- Name — human-friendly, shown in the site switcher and history log.
- Domain — the site's canonical domain, e.g.
https://www.example.com. Don't include a path. - Target taxonomy — optional at this stage; you can add tags later via Populate from CMS once the CMS is connected.
After creating the site you land on its overview page. From here you'll open Settings (gear icon on any workflow block, or Settings tab in the page nav) to configure the rest.
Allowed domains — if the client owns multiple domains/subdomains that share content (e.g. www.example.com and docs.example.com), add each to the Allowed domains list in Settings → Overview. This scopes GSC/GA4 sync and crawl discovery to content you actually own.
3. Connect GSC + GA4
Click the gear on the Enrich block to land in Settings → Analytics. You'll see a Connect Google button if no OAuth has been linked yet. Click it, sign in with a Google account that has access to the client's GSC + GA4 properties, and approve the requested scopes.
Once connected, use the property pickers to select the GSC property and GA4 property for this site. The GSC picker shows every property your Google account has read access to; same for GA4. Make sure the property actually matches the site's domain — mismatches are the #1 cause of "why is GSC showing 0 clicks?".
Lookback window — default 12 months. You can override per site under Settings → Analytics → Lookback. Longer windows see more data but take longer to sync and use more GSC quota.
After configuring, click Sync GSC and Sync GA4 on the Enrich block. These are background jobs — you can close the page and come back. Progress shows in the banner at the top of the site overview.
4. CMS integration
Most sites are connected via a CMS API instead of being crawled. The CMS returns clean, editorial HTML without template chrome (nav, footer, sidebar), which makes similarity analysis much more reliable.
WordPress
In Settings → Crawl Behavior → CMS Import, set CMS Type to wordpress and enter the site's public endpoint (e.g. https://www.example.com). The importer hits the standard /wp-json/wp/v2/posts endpoints. If the site protects the REST API, supply basic auth credentials in the Auth fields.
Post types — by default we pull posts, pages, and any custom post types you list. Use Exclude post types to skip ones you don't want in the audit (e.g. product_variation, shop_order).
Drupal
Set CMS Type to drupal. The importer uses JSON:API (/jsonapi/node/*). If the site uses non-standard taxonomy field names or wants nested paragraphs expanded, you'll need to configure those fields explicitly.
Coming soon (Sprint 26): Drupal integration hardening — configurable taxonomy field names, configurable include paths for nested paragraphs/media, rate-limited HTTP client to play nicely with throttled CMS hosts, plus a Test Connection + schema discovery endpoint. This unblocks sites like resources.altium.com that use non-default field names.
Full import vs Light Refresh
The Import button on the Ingest block defaults to Light Refresh — hash-cached, only re-processes pages whose content changed since the last run. Fast, cheap, idempotent. Use the dropdown to switch to Full Import when you need to re-embed every page (for example, after changing extraction selectors).
5. Crawler settings
For sites without a CMS integration, use the HTML crawler. Configure it in Settings → Crawl Behavior.
- User-Agent — pick a preset from the dropdown (Chrome on Windows is the default) or choose Custom to supply your own. The "Sent as:" line below the dropdown shows the exact string that will be sent — copy it into server-log checks when debugging whether a target site is blocking us.
- Rate limit (req/sec) — default 2 (120 req/min). Lower for fragile hosts; raise only if the target site explicitly allows it.
- Respect robots.txt — default on. Turn off only for clients who have authorized a full crawl on their own site.
- Include / exclude patterns — regex patterns matched against URLs. Use to scope the crawl to editorial content and skip admin, search, archive, and pagination endpoints.
7. Similarity thresholds
Defaults are tuned for long-form editorial content on marketing sites. Check Settings → Advanced to adjust per site. Hover the ℹ icons next to each threshold for in-context explanations. Full methodology lives on the Glossary page.
Common tweaks:
- Documentation-heavy sites — drop
chunk_thresholdfrom 0.80 to ~0.75 if legitimate duplication between spec/reference pages is being flagged. - News/blog sites — raise
title_thresholdfrom 0.75 to ~0.80 if near-duplicate headline wording is creating noise (e.g. "5 tips for X" vs "Top 5 X tips"). - Technical taxonomies — lower
tag_overlap_thresholdfrom 0.5 to ~0.3 when tag vocabularies are narrow and even 1–2 shared tags is a strong topical signal.
Coming soon (Sprint 23): Reranker-confidence-aware peer ordering on the Similarity page and a no-signal demotion gate on topical_confusion. These don't require any setting changes — they just make existing Phase 2 output more useful.
8. Run your first audit
The happy path, step by step:
- Ingest — click Import on the Ingest block (Light Refresh). Wait for the Round summary to show "ready." If you're using the HTML crawler instead, paste URLs into Crawl from list.
- Enrich — click Sync GSC and Sync GA4. These can run in parallel with each other on the same site. Wait for both to complete.
- Score — click Score on the Analyze block. This is a deterministic structural pass followed by an AI quality verdict per page. Skips pages whose content hasn't changed since the last score.
- Analyze — click Analyze. Runs Phase 1 (fast, free, deterministic) + Phase 2 (cross-encoder reranker, fast, free). Produces the clusters + pairs that populate the Similarity page.
- Validate — click Validate. AI pass that flags title/meta/content mismatches against the target taxonomy.
Total time for a ~1000-page site: ~5–10 minutes for ingest, ~2 minutes for enrich, ~3–5 minutes for score, ~5–15 minutes for analyze, ~5 minutes for validate. Everything is backgrounded — you can leave the page.
Resume behavior for long-running jobs — Score, Analyze, and Validate are all resume-friendly as of Sprint 27. The primary button skips pages or pairs that are already up-to-date (same content hash). When a page's content changes, the associated analyze + validate state is invalidated automatically. Each of Score / Analyze / Validate also has a Re-score all / Re-analyze all / Re-validate all outline button that forces a clean recompute — use these when you've changed thresholds, want a deterministic baseline, or are rebuilding after a config change.
9. Troubleshooting
❗! Last import failed / Last crawl failed
The Ingest or Crawl section shows a red error pill + Retry button. Click Retry — for imports, this routes to Light Refresh (hash-cached, so already-imported pages skip on re-run). If the same error keeps happening at roughly the same page count, you're probably being rate-limited or blocked by the CMS host. Lower Rate limit (req/sec) in Settings, or ask the host to whitelist the backend IP.
GSC or GA4 sync says "Server disconnected"
Historically caused by HTTP/2 connection drops on the Supabase write leg. Fixed as of Sprint 27 — if you still see this, something is genuinely wrong; check the History page for the full error traceback and escalate.
Site shows "paused" and won't resume
A Reset Processing State (Advanced dropdown on the site overview) leaves the round paused. The Resume button next to the workflow block picks it up — for CMS imports this creates a fresh round (the backend auto-abandons the stuck one), for crawls it re-uses the stored URL list.
Tags aren't extracting
Open the Tag selector tester under Settings → Crawl Behavior → Tag selector. Paste a URL, run the test, and compare what the site actually returns vs what your selectors expect. Remember to HTML-decode (the extractor does this automatically, but if you're debugging raw HTML the test output is post-decode).
Multiple sites syncing at once
Known limitation: kicking off GSC/GA syncs on multiple sites simultaneously can exhaust the backend's httpx connection pool and cause "Server disconnected" on the loser. Stagger syncs between sites (finish one, then start the next). A cross-site job queue is on the roadmap.
10. Per-site settings reference
Field-by-field list of every per-site configuration. Each field has an ℹ hover in the actual Settings UI with examples — this reference is for skimming.
Overview
name— display namedomain— canonical site domainallowed_domains— list of domains the audit should scope to (GSC/GA filtering, crawl scope)brand_terms— keywords that mark a GSC query as brand-driven; excluded from cannibalization signal
Crawl Behavior
crawler_config.user_agent— HTTP header sent on crawlscrawler_config.rate_limit_rps— max requests/secondcrawler_config.respect_robots— honor robots.txturl_include_patterns/url_exclude_patterns— regex scope filtersextraction_selectors— per-site CSS selector overrides for title, meta, tags, main content, and strip-selectors (boilerplate to remove)cms_type,cms_endpoint,cms_auth,cms_config— CMS integration (WordPress/Drupal)
Tag Handling
target_taxonomy— canonical tag listtaxonomy_metadata— friendly names, parent relationships, sourceextraction_selectors.use_taxonomy_as_whitelist— enforce whitelist mode
Analytics
gsc_property— GSC property URL (sc-domain or https://)ga4_property_id— GA4 numeric property IDanalytics_url_normalization— lookback_days/months, keep_query_params, strip_all_query_params
Advanced
similarity_config.title_threshold,chunk_threshold,chunk_overlap_thresholdsimilarity_config.query_overlap_threshold,impression_overlap_threshold,cannibalization_*_floorsimilarity_config.tag_overlap_thresholdsimilarity_config.partial_dupe_title_gate,partial_dupe_headings_gateauto_score_after_crawl— enqueue a score job automatically when a crawl or import finishesdeep_analysis_opted_in— skip Phase 3 cost-confirmation prompt