Walsh-Research Bot Compliance Specification

1. Status
- 1.1. Interactive REPL use
2. How to use this document
3. Conformance language
4. Versioning and identity
5. Canonical resources
6. Normative constants
7. Pre-request gate
8. Identity — R1
- 8.1. R1 — User-Agent (MUST)
9. Pre-request gates — R2, R3, R4, R5
- 9.1. R2 — robots.txt, RFC 9309 (MUST)
- 9.2. R3 — Operator blocklist (MUST)
- 9.3. R4 — Rate limiting and Crawl-delay (MUST)
- 9.4. R5 — Backoff and Retry-After (MUST)
10. Scope and data-handling limits — R6, R7, R8, R9
- 10.1. R6 — Stay in scope (MUST / MUST NOT)
- 10.2. R7 — Cache-friendly and de-duplicating (SHOULD)
- 10.3. R8 — Frequency (SHOULD)
- 10.4. R9 — Prefer structured formats; never scrape HTML (SHOULD / MUST NOT)
11. Caching and persistence — R10, R11, R12
- 11.1. R10 — Blocklist schema: inline-first, external fallback (MUST)
- 11.2. R11 — Persistent caches (SHOULD)
- 11.3. R12 — Cache-miss behavior (clarification)
12. Conformance test vectors
- 12.1. robots.txt group selection (token Walsh-Research)
- 12.2. robots.txt path matching (within the selected group)
- 12.3. Blocklist domain matching (listed domain example.com)
- 12.4. URL canonicalization (for dedup)
- 12.5. Retry-After parsing (now = Mon, 23 May 2026 00:00:00 GMT)
- 12.6. ISO-8601 refresh / TTL durations
13. Conformance fixtures
14. Attestation
15. Conformance checklist
- 15.1. Reference-implementation notes
16. Self-test canaries (dogfood)
17. Opt-out workflow (operator side)
18. Build recipe
19. Deferred (out of scope for v1.2)
20. Porting notes
21. References
22. Changelog
- 22.1. R13: Implementation tagging (SHOULD)

1. Status

Spec version: walsh-research-compliance/v1.3 (2026-05-24). Supersedes v1.2.
Bot identity governed: Walsh-Research/MAJOR.MINOR (currently 1.2).
Applies to: every tool that fetches third-party resources under the Walsh-Research identity — crawlers, agents, one-off scripts — in any language.
Reference implementations: the Clojure crawler jwalsh/tech-crawler and a Python conformance harness (see 15.1). References, not the contract; this document is the contract.
Authoritative public surface: https://wal.sh/bot/ and the /.well-known/walsh-research/ documents (5).

This is the single, authoritative specification for the Walsh-Research bot contract. It is written to be build-complete: a competent implementer (human or LLM) given only this document — and the URLs it names — can produce a conformant Walsh-Research/x.y tool in any language, then verify it against the 12 and the live 13. It specifies behavior and data contracts, never a particular implementation.

A note on what "conformant" means: the conformant entity is the tool a publisher runs under the Walsh-Research/1.2 identity, bound to this contract. A general-purpose LLM that fetches a URL while reasoning is not a Walsh-Research bot, and generating code that could be such a tool does not make the generator one. Conformance is a property of the bound, deployed tool.

1.1. Interactive REPL use

A human driving a Walsh-Research implementation interactively at a REPL is not an automated crawler. RFC 9309 governs automated access; a person typing (http-get "https://example.com") and reading the output is interactive research, not bot activity. The UA string identifies the tool lineage, not the automation level.

Interactive REPL sessions that use the Walsh-Research UA:

Are not bound by R4 (throttle) or R5 (backoff) — human typing speed is the natural rate limiter.
SHOULD still honor R2 (robots.txt) and R3 (blocklist) as a matter of operational courtesy, but a human researcher exercising a single URL at a prompt is not the scenario these gates were designed to protect against.
SHOULD tag requests with R13, using sha=dev or sha=repl to distinguish interactive exploration from automated runs.
MUST use the same Walsh-Research UA (R1) so the operator can identify the tool family in access logs.

The distinction: an automated gmake crawl is a bot. A human at ghci> or user> or >>> is a researcher using the same code.

If this spec and https://wal.sh/bot/ ever disagree, https://wal.sh/bot/ is authoritative for what we promise the public; this spec is authoritative for how a conformant tool must behave.

2. How to use this document

Read 3, 4, and the 6 table.
Implement the 7 pipeline and requirements R1–R12. Each states its inputs, outputs, ordering, edge cases, and exact constants — everything needed to implement it with no further reference.
Verify against the 12 (offline input -> expected pairs) and the live 13 (URLs you can fetch).
Self-audit with the 15; optionally emit an 14 document.
Follow the 18 for a suggested order of work.

A note on the code. Any code in this document is Clojure, shown for demonstration only (Clojure is the reference implementation's language). It is illustrative, not prescriptive: conformance is defined by the behavior, contracts, constants, and 12 here, never by these snippets. The compliance implementation team chooses its own language and design; the 20 map the handful of primitives across languages.

3. Conformance language

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used as in RFC 2119 / RFC 8174. A tool is conformant with a spec version when it satisfies every MUST and MUST NOT of that version. SHOULD-level items are expected; their absence must be deliberate and justifiable.

"A tool" means the software acting under the Walsh-Research identity. "A host" means a registered domain name plus optional port. "A target" means a single URL the tool has been explicitly configured to fetch.

4. Versioning and identity

Three independently-versioned contracts:

Bot product version — Walsh-Research/MAJOR.MINOR, carried in the User-Agent (R1). It identifies the deployed bot line, not the spec version. robots.txt matching uses only the bare product token Walsh-Research (R2), never the version.
Spec version — walsh-research-compliance/vN[.M]. A breaking change to a MUST bumps the major; additive MUSTs / promoted SHOULDs bump the minor.
Data contracts — walsh-research-blocklist/vN, walsh-research-test-fixtures/vN, walsh-research-attestation/vN, each versioned on its own.

A conformant tool MUST be able to state which spec version it targets (e.g. in its 14). A tool targeting an older spec version remains conformant against that version. The product version and the spec version move independently: a tool may bump its product version (Walsh-Research/1.2) or its targeted spec revision without the two being locked together, and a future Walsh-Research/1.4 could still target walsh-research-compliance/v1.3.

A tool MAY additionally annotate the spec version it targets out of band in its attestation document. A tool MAY NOT alter the exact R1 User-Agent to carry it (R1 is an exact-match MUST); any extra UA comment token MUST NOT affect the Walsh-Research robots token used for matching.

5. Canonical resources

A tool MUST treat these URLs as the live source of truth (no bundled copy may override a successfully fetched live document):

Resource	URL
Public policy	https://wal.sh/bot/
This spec	https://wal.sh/research/bots/compliance-spec
Blocklist data	https://wal.sh/.well-known/walsh-research/blocklist.json
Blocklist schema	https://wal.sh/.well-known/walsh-research/blocklist.schema.json
Test fixtures	https://wal.sh/.well-known/walsh-research/test-fixtures.json

The blocklist document is self-describing: it carries its own JSON Schema inline under a top-level schema key (R3c), so a tool can fetch one URL and validate it without a second request. The standalone blocklist.schema.json remains the canonical copy (for external tooling and as the schema's $id) and is identical.

From https://wal.sh/bot/ alone a tool can discover this spec; from the spec, the blocklist (schema included) and fixtures; from the fixtures, every URL needed to self-verify. The whole bootstrap is reachable from one starting URL.

The blocklist and fixtures contracts and the schema are published under no restrictions and may be reused by other operators.

6. Normative constants

These are the only tunables. Values are defaults a tool SHOULD use unless a fetched document dictates otherwise (e.g. refresh, Crawl-delay, Retry-After).

Constant	Value	Requirement
`USER_AGENT`	`Mozilla/5.0 (compatible; Walsh-Research/1.2; +https://wal.sh/bot/)`	R1 (exact)
`UA_TOKEN`	`Walsh-Research` (matched case-insensitively)	R1, R2
`ROBOTS_TTL_SECONDS`	`86400` (24h)	R2
`BLOCKLIST_REFRESH`	document `refresh` field; default `PT6H` (`21600` s)	R3
`SCHEMA_TTL_SECONDS`	`604800` (7d)	R10
`MIN_REQUEST_INTERVAL`	`1.0` s per host	R4
`DEFAULT_MAX_RETRIES`	`5`	R5
`DEFAULT_BACKOFF_BASE`	`1.0` s (exponential)	R5
`BACKOFF_CAP`	`60` s	R5
`STALE_HARD_CUTOFF`	`2 ×` the relevant TTL (only if stale-while-revalidate is used)	R12

7. Pre-request gate

For every target URL, a tool MUST evaluate the gates below in this exact order. The first gate that returns DENY stops the request; no later gate or fetch runs. The order is normative and non-commutative: R3 precedes R2 because an operator opt-out is absolute and may apply to a host whose robots.txt is itself unreachable; R4 follows both because pacing a request we will deny wastes wall clock.

Listing 1: Pre-request gate: the R3 -> R2 -> R4 decision flow; the first gate to deny stops the request.

// Walsh-Research pre-request gate: blocklist -> robots.txt -> throttle -> fetch
digraph pre_request_gate {
    rankdir=TB;
    bgcolor=white;
    node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10];
    edge [color="#888888", fontcolor="#555555", fontname="Helvetica", fontsize=9];

    url [label="Target URL", fillcolor="#dbeafe", color="#1d4ed8", fontcolor="#1d4ed8"];

    // Gate 1: Operator blocklist (R3)
    subgraph cluster_r3 {
        label="R3: Operator Blocklist"; style="rounded"; color="#b91c1c"; fontcolor="#b91c1c";
        fontname="Helvetica"; fontsize=11;
        blocklist [label="Fetch blocklist.json\n(cached, PT6H refresh)",
                   fillcolor="#fee2e2", color="#b91c1c", fontcolor="#b91c1c"];
        bl_check [label="Domain in\nblocked[]?",
                  shape=diamond, fillcolor="#fee2e2", color="#b91c1c", fontcolor="#b91c1c"];
    }

    // Gate 2: robots.txt (R2)
    subgraph cluster_r2 {
        label="R2: robots.txt (RFC 9309)"; style="rounded"; color="#6b21a8"; fontcolor="#6b21a8";
        fontname="Helvetica"; fontsize=11;
        robots [label="Fetch /robots.txt\n(cached, 24h TTL)",
                fillcolor="#ede9fe", color="#6b21a8", fontcolor="#6b21a8"];
        group [label="Select group:\nWalsh-Research > *",
               fillcolor="#ede9fe", color="#6b21a8", fontcolor="#6b21a8"];
        disallow [label="Path\ndisallowed?",
                  shape=diamond, fillcolor="#ede9fe", color="#6b21a8", fontcolor="#6b21a8"];
    }

    // Gate 3: Rate limit (R4)
    subgraph cluster_r4 {
        label="R4: Rate Limit"; style="rounded"; color="#b45309"; fontcolor="#b45309";
        fontname="Helvetica"; fontsize=11;
        throttle [label="Throttle\nmax(1s, Crawl-delay)",
                  fillcolor="#fef3c7", color="#b45309", fontcolor="#b45309"];
    }

    // Outcomes
    fetch [label="FETCH\n(with backoff R5)",
           fillcolor="#dcfce7", color="#15803d", fontcolor="#15803d"];
    deny [label="DENY\n(skip URL)",
          fillcolor="#fee2e2", color="#b91c1c", fontcolor="#b91c1c"];

    // Flow
    url -> blocklist;
    blocklist -> bl_check;
    bl_check -> deny [label="yes", color="#b91c1c", fontcolor="#b91c1c"];
    bl_check -> robots [label="no"];
    robots -> group;
    group -> disallow;
    disallow -> deny [label="yes", color="#b91c1c", fontcolor="#b91c1c"];
    disallow -> throttle [label="no"];
    throttle -> fetch;
}

;; Demo (Clojure). The gate order is normative; the language is not.
(defn may-fetch? [url]
  (cond
    (operator-blocklist-blocks? url) :deny   ; R3, checked FIRST
    (not (robots-allows? url))       :deny   ; R2, RFC 9309
    :else (do (throttle! url)                ; R4, blocks until the slot is free
              :allow)))

If may_fetch returns ALLOW the tool issues the request with conditional headers (R7) and applies backoff (R5) to rate-limit responses.

The same lifecycle as a sequence over time — one target URL, the gates in order, and the participants each step talks to:

Listing 2: Request lifecycle as a sequence: one target URL, the gates in order, and the participant each step talks to.

// Walsh-Research request lifecycle sequence (R3 -> R2 -> R4 -> fetch -> R5/R7)
digraph request_sequence {
    rankdir=LR;
    graph [bgcolor="white", fontname="Helvetica", fontsize=11,
           pad="0.3", nodesep="0.4", ranksep="0.8"];
    node  [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10];
    edge  [fontname="Helvetica", fontsize=9, color="#888888"];

    bot       [label="Walsh-Research\nbot",      fillcolor="#dbeafe", color="#1d4ed8", fontcolor="#1d4ed8"];
    blocklist [label="wal.sh\nblocklist.json",   fillcolor="#fee2e2", color="#b91c1c", fontcolor="#b91c1c"];
    robots    [label="target host\n/robots.txt", fillcolor="#ede9fe", color="#6b21a8", fontcolor="#6b21a8"];
    target    [label="target\nURL",              fillcolor="#dcfce7", color="#15803d", fontcolor="#15803d"];

    // R3: operator blocklist, checked first
    bot -> blocklist [label="1. GET blocklist.json\n(cached PT6H)", color="#b91c1c"];
    blocklist -> bot [label="2. list + inline schema\nvalidate; host blocked?", style=dashed, color="#b91c1c"];

    // R2: robots.txt (RFC 9309)
    bot -> robots [label="3. GET /robots.txt\n(cached 24h)", color="#6b21a8"];
    robots -> bot [label="4. select Walsh-Research\ngroup; longest-match\nallow/deny", style=dashed, color="#6b21a8"];

    // R4 throttle, conditional fetch (R7), backoff (R5)
    bot -> target [label="5. throttle max(1s, Crawl-delay)\nthen GET (If-None-Match /\nIf-Modified-Since)", color="#b45309"];
    target -> bot [label="6. 200 +ETag / 304\n(429/503 -> R5 backoff)", style=dashed, color="#15803d"];

    note1 [label="R3 before R2:\nopt-out is absolute,\nhonored even if\nrobots is unreachable",
           shape=note, fillcolor="#fef3c7", color="#b45309", fontcolor="#b45309", fontsize=9, style="filled"];
    note2 [label="metadata only;\nno recursion,\nno sub-resources (R6)",
           shape=note, fillcolor="#dcfce7", color="#15803d", fontcolor="#15803d", fontsize=9, style="filled"];

    { rank=same; blocklist; note1; }
    { rank=same; target; note2; }
    note1 -> blocklist [style=dotted, color="#b45309", arrowhead=none, constraint=false];
    note2 -> target    [style=dotted, color="#15803d", arrowhead=none, constraint=false];
}

Listing 3: Backoff state machine (R5): retry 429/503 with Retry-After or exponential backoff plus jitter, bounded by max retries.

// Walsh-Research backoff state machine (R5): retry on 429/503
digraph backoff_states {
    rankdir=LR;
    bgcolor=white;
    node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10];
    edge [color="#888888", fontcolor="#555555", fontname="Helvetica", fontsize=9];

    fetch [label="Fetch URL",
           fillcolor="#dbeafe", color="#1d4ed8", fontcolor="#1d4ed8"];
    check [label="Status?",
           shape=diamond, fillcolor="#fef3c7", color="#b45309", fontcolor="#b45309"];
    success [label="200/304\nProcess",
             fillcolor="#dcfce7", color="#15803d", fontcolor="#15803d"];
    wait [label="Wait\nRetry-After or\n2^n x base x rand",
          fillcolor="#fef9c3", color="#a16207", fontcolor="#a16207"];
    retry_check [label="n < max?",
                 shape=diamond, fillcolor="#fef3c7", color="#b45309", fontcolor="#b45309"];
    abort [label="Abort\nlog + wait 1h",
           fillcolor="#fee2e2", color="#b91c1c", fontcolor="#b91c1c"];

    fetch -> check;
    check -> success [label="2xx/3xx"];
    check -> wait [label="429/503"];
    check -> abort [label="4xx/5xx\n(other)"];
    wait -> retry_check;
    retry_check -> fetch [label="yes\nn++"];
    retry_check -> abort [label="no\n(max retries)"];
}

8. Identity — R1

8.1. R1 — User-Agent (MUST)

Every request MUST carry the exact USER_AGENT header value from 6.

The product token for robots.txt matching is Walsh-Research, matched case-insensitively (R2).
The token MUST be followed by a MAJOR.MINOR product version, so the declared identity is Walsh-Research/<major>.<minor> (currently Walsh-Research/1.2). The version is part of the exact UA string but is not used for robots matching (R2 keys on the bare Walsh-Research token); it lets operators read the bot's product line from their access logs and lets the product version move independently of the spec version (4).
The + URL MUST resolve to a public policy page describing the bot and how to opt out.
A tool MUST NOT impersonate a browser or another crawler, or omit the token, or rotate/randomize the User-Agent to evade rules.

9. Pre-request gates — R2, R3, R4, R5

9.1. R2 — robots.txt, RFC 9309 (MUST)

A tool MUST fetch /robots.txt over the target's scheme+host, parse it per RFC 9309, and obey the rules selected for the Walsh-Research token before fetching any target on that host. It SHOULD cache robots.txt per host for ROBOTS_TTL_SECONDS.

9.1.1. Fetch and error handling

Fetch SCHEME://HOST[:PORT]/robots.txt.
200 -> parse the body.
4xx (including 404) -> treat as no rules: everything allowed.
5xx, network error, or timeout -> a tool SHOULD fail open (allow) but MAY defer the host; it MUST NOT treat an unreachable robots.txt as a blanket disallow that silently drops a host forever.
A body over 500 KiB MAY be truncated to 500 KiB before parsing.

9.1.2. Group parsing

A group is one or more consecutive User-agent lines followed by the rule lines (Allow, Disallow, Crawl-delay) that apply to them. A User-agent line appearing after a rule line starts a new group. Lines are field: value; # begins a comment; field names are case-insensitive; surrounding whitespace is trimmed; blank lines are ignored. Records with the same User-agent value merge.

9.1.3. Group selection (RFC 9309 §2.2.1, longest match)

Let T = "walsh-research" (lowercased token).

For each User-agent value U (lowercased), U matches T when any of:

U = "*" (wildcard)
U = T (exact match)
U is a prefix of T (walsh matches walsh-research)

Selection precedence:

The non-* group whose matching U is longest wins.
If no named group matches, use the * group.
If no * group exists, no rules apply (allow all).

A named group therefore overrides * in both directions: it can disallow where * allows, and allow where * disallows. Only the selected group's rules are consulted.

A tool that wraps a parser lacking longest-match selection MUST add a selection layer.

\* Group selection algorithm (RFC 9309 §2.2.1)
SelectGroup(groups, T) ==
  LET matches == {g \in groups :
        LET U == Lowercase(g.user_agent)
        IN  U = T \/ IsPrefix(U, T)}
      best == CHOOSE g \in matches :
        \A h \in matches : Len(g.user_agent) >= Len(h.user_agent)
  IN  IF matches /= {} THEN best
      ELSE IF \E g \in groups : g.user_agent = "*" THEN
           CHOOSE g \in groups : g.user_agent = "*"
      ELSE NoRules

9.1.4. Path matching within the selected group (RFC 9309 §2.2.2–2.2.3)

Match the request's path + query (default / when empty) against the group's Allow and Disallow rules:

A rule value "" matches nothing (an empty Disallow allows all).
* matches any sequence (including none); $ anchors to the end of path+query.
Otherwise a rule matches if path+query starts with the rule value (after expanding any *).
Compare percent-encoding consistently (decode unreserved octets on both sides).

Decision: among matching rules, the longest value wins (an expanded * counts as the characters it consumed). On a length tie, Allow wins. A winning Disallow -> denied; a winning Allow, or no match -> allowed.

A tool that does not implement Allow or *=/=$ MAY degrade only in the safe direction: it MUST NOT fetch a URL the full algorithm would deny (no under-blocking), but it MAY over-block a URL an Allow would have permitted. Under-blocking is a violation; conservative over-blocking is not.

9.1.5. Crawl-delay (de-facto extension, MUST honor)

Crawl-delay is not in RFC 9309 but is widely deployed; a tool MUST honor it when present in the selected group as the minimum seconds between requests to that host (R4). Fractional allowed; ignore a non-numeric value.

9.1.6. Opt-out a site can self-serve

User-agent: Walsh-Research
Disallow: /

9.2. R3 — Operator blocklist (MUST)

Some opt-outs arrive out of band (email to j@wal.sh). They live in one published JSON document that every Walsh-Research tool MUST consult, so an opt-out is honored uniformly across tools and languages. It is checked before robots.txt and is in addition to it.

Contract walsh-research-blocklist/v1:

{
  "$schema":  "https://wal.sh/.well-known/walsh-research/blocklist.schema.json",
  "contract": "walsh-research-blocklist/v1",
  "updated":  "2026-05-23T00:00:00Z",
  "operator": "Jason Walsh",
  "contact":  "j@wal.sh",
  "policy":   "https://wal.sh/bot/",
  "refresh":  "PT6H",
  "blocked": [
    { "domain": "example.com", "added": "2026-05-23", "reason": "email opt-out" }
  ]
}

The schema value above is abbreviated; in the live document it is the complete Draft 2020-12 schema, byte-identical to the standalone blocklist.schema.json.

Field	Type	Meaning
`contract`	string	`walsh-research-blocklist/vN`; reject an unknown major
`updated`	string	RFC 3339 timestamp the list last changed
`refresh`	string	ISO-8601 duration; cache TTL (`PT6H` = 6h)
`schema`	object	inline Draft 2020-12 schema; validate the document against it (R3c)
`blocked[]`	array	opt-out entries
`blocked[].domain`	string	registered domain (apex) to block
`blocked[].added`	string	date added (informational)
`blocked[].reason`	string	reason (informational)

Requirements:

A tool MUST skip any request whose host equals a listed domain or is a subdomain of it. Match case-insensitively with port stripped: host H is blocked by domain D iff H = D or H ends with "." + D. (example.com blocks example.com, www.example.com; not notexample.com or example.com.evil.test.)
(R3c, MUST) A tool MUST validate the document before adopting it, and MUST NOT adopt one that fails validation or whose contract major is unknown — retain the previously held list instead. The document is self-describing: a tool SHOULD validate against the inline schema field (single fetch); if it is absent, fall back to the standalone blocklist.schema.json (R10). Validating a document against its own embedded schema is a structural check (it catches malformed or wrong-contract data), not anti-tamper; transport integrity (HTTPS to the canonical origin) covers tampering.
A tool SHOULD cache the list for the refresh duration. Timing is dictated by the data, not hard-coded.
(R3b, MUST) On fetch failure, timeout, or 404, retain the last-known list: a transient outage MUST NOT silently un-block an opt-out. The first-ever fetch failing yields an empty list (fail open) — nothing is protected yet.

9.3. R4 — Rate limiting and Crawl-delay (MUST)

Requests MUST be serial within a process: no concurrent connections.
A tool MUST NOT exceed one request per second per host.
Minimum spacing before the next request to a host is max(MIN_REQUEST_INTERVAL, Crawl-delay) (R2).
The first request to a previously-unseen host MUST NOT be delayed.
The throttle key is the host.

;; Demo (Clojure). Blocks the caller until the host's slot is free.
(defn throttle! [url]
  (let [host     (host-of url)
        interval (max min-request-interval (robots-crawl-delay host))
        last     (get @last-request-time host)]   ; nil => unseen => no wait
    (when last
      (let [wait (- (+ last interval) (now))]
        (when (pos? wait) (sleep wait))))
    (swap! last-request-time assoc host (now))))

(R4a, informational.) "Serial within a process" means a deployment that runs multiple instances MUST add cross-process coordination (a file lock, shared store, or token bucket) to keep the per-host ceiling. Single-instance scheduling is the recommended default.

9.4. R5 — Backoff and Retry-After (MUST)

On HTTP 429 or 503 a tool MUST back off and retry. Other 5xx MAY be retried with the same policy.

If Retry-After is present, wait at least that long, using it in place of the computed delay. Retry-After is either integer seconds or an HTTP-date (RFC 9110); for the date form compute the delay to that instant (clamp negatives to 0).
Otherwise use exponential backoff with full jitter: delay = random_between(0, min(BACKOFF_CAP, DEFAULT_BACKOFF_BASE * 2^attempt)), attempt starting at 0.
A tool MUST bound retries at DEFAULT_MAX_RETRIES, then give up for that target and record the failure.
A tool MUST NOT retry faster than the schedule, and MUST NOT treat a 429=/=503 as success.

10. Scope and data-handling limits — R6, R7, R8, R9

10.1. R6 — Stay in scope (MUST / MUST NOT)

MUST NOT follow links or crawl recursively; fetch only declared targets.
MUST NOT download sub-resources (images, scripts, stylesheets, fonts).
MUST NOT train models on, index for search, or republish fetched content. Extract metadata, not content.
MUST NOT retain fetched content beyond what the immediate analysis needs.

10.2. R7 — Cache-friendly and de-duplicating (SHOULD)

Conditional requests. Store the ETag response header and replay it as If-None-Match; store Last-Modified and replay as If-Modified-Since. On 304 Not Modified: treat as "no new data", do no further processing, keep the stored validators. On 200: process and refresh the validators.

Canonical-URL de-duplication. Dedup results by a canonical key:

Lowercase scheme and host.
Drop a default port (80 http, 443 https).
Remove the fragment (from #).
Empty path -> /; else strip a single trailing / (never reduce / to empty).
Preserve the query string verbatim (do not reorder or drop parameters).

Canonical string: scheme://host[:non-default-port]path[?query].

10.3. R8 — Frequency (SHOULD)

A tool SHOULD fetch each source infrequently — daily or less — and SHOULD NOT re-fetch a source more than once per hour absent a specific reason. Enforcement is an orchestration concern (a per-source last_run_at), not gate logic.

10.4. R9 — Prefer structured formats; never scrape HTML (SHOULD / MUST NOT)

A tool MUST NOT extract content by scraping arbitrary HTML with per-site CSS or XPath selectors.

Markdown-first by content negotiation (general rule). Whenever a tool would request a page that it expects to be HTML, it SHOULD first send Accept: text/markdown and use the response body when the server answers with a markdown content-type (the Content-Type contains markdown), falling back to HTML (or another representation) only when markdown is not offered. This is one extra request header and yields a clean, selector-free representation, so it is the default for any page fetch, not only feed-less sources. A growing set of docs platforms content-negotiate markdown; most sites ignore the header and return HTML, which is fine.

For a feed-less source a tool SHOULD obtain a structured representation, preferring, in order:

A real feed (RSS / Atom / JSON Feed) when one exists.
Content negotiation — send Accept: text/markdown; use the body only if the response Content-Type contains markdown.
llms.txt — fetch /llms.txt at the host root (https://llmstxt.org), often served as text/plain; accept a text/plain body there as markdown.

From any markdown body, extract items generically: inline links [text](url) with absolute http(s) url, de-duplicated, in document order; text is the title. A tool MUST NOT rely on per-site structure. A source offering none of the above is left out, not scraped. llms.txt is advisory: R2 and R3 still decide whether a tool may fetch.

11. Caching and persistence — R10, R11, R12

11.1. R10 — Blocklist schema: inline-first, external fallback (MUST)

The blocklist is self-describing (R3c): in the common path a tool validates against the inline schema field and performs no separate schema fetch, so the schema shares the blocklist's refresh TTL and there is nothing extra to cache. The rules below govern the fallback path — a document that omits schema, where the tool fetches the standalone blocklist.schema.json:

A tool SHOULD cache the external schema for SCHEMA_TTL_SECONDS.
On an external-schema refresh failure with a prior schema cached, a tool SHOULD retain and use the stale schema (fail-open) and MUST log it. A transient upstream failure must not silently disable validation.
(MUST) If a document omits schema and no external schema can be obtained (none inline, none cached, fetch fails), a tool MUST NOT adopt the document — with no schema at all there is no way to distinguish a valid blocklist from arbitrary JSON. Deliberately asymmetric with R3's data fail-open: "cannot validate" is fail-closed for adoption.

11.2. R11 — Persistent caches (SHOULD)

A tool SHOULD persist its robots.txt, blocklist, and schema caches across process invocations, so a tool invoked once per cron run does not re-fetch every time. Storage format and location are the implementer's choice (JSON, SQLite, EDN, language-native serialization). The contract is external — observable fetch behavior at the declared TTLs — not how the cache is stored.

11.3. R12 — Cache-miss behavior (clarification)

When a cache entry is stale, the default is block-on-refresh: the next request stalls until the refresh completes (deterministic, no async machinery). A tool MAY instead implement stale-while-revalidate with a hard cutoff at STALE_HARD_CUTOFF. Either is conformant.

12. Conformance test vectors

These input -> expected pairs are normative and implementation-free; a conformant tool MUST reproduce every expected output. They double as a portable offline suite.

12.1. robots.txt group selection (token `Walsh-Research`)

robots.txt (groups)	selected group	why
`*`: Disallow /a ‖ `Walsh-Research`: Disallow /b	Walsh-Research	named beats `*`
`*`: Disallow /a	`*`	no named group
`walsh`: Disallow /x ‖ `walsh-research`: Disallow /y	walsh-research	longest prefix match
`Googlebot`: Disallow /	(none)	no named, no `*` -> allow all
`WALSH-RESEARCH`: Disallow /z	WALSH-RESEARCH	case-insensitive

12.2. robots.txt path matching (within the selected group)

rules	path	result	why
Disallow: /research/bots/dogfood-disallow	/research/bots/dogfood-disallow.html	DENY	prefix match
Disallow: /research/bots/dogfood-disallow	research/bots	ALLOW	not a prefix of the rule
Disallow: /	/anything	DENY	root disallow
Disallow:	/anything	ALLOW	empty Disallow matches nothing
Disallow: /a + Allow: /a/b	/a/b/c	ALLOW	longer Allow wins
Disallow: /a + Allow: /a	/a/x	ALLOW	equal length -> Allow wins
Disallow: /*.pdf$	/docs/report.pdf	DENY	`*` + end-anchor
Disallow: /*.pdf$	/docs/report.pdf?x=1	ALLOW	`$` anchors end; query breaks it

12.3. Blocklist domain matching (listed domain `example.com`)

host	blocked?
example.com	yes
www.example.com	yes
a.b.example.com	yes
EXAMPLE.COM	yes
notexample.com	no
example.com.evil.test	no
example.org	no

12.4. URL canonicalization (for dedup)

input	canonical
`HTTP://Example.com/Path/`	`http://example.com/Path`
`https://x.test:443/a#frag`	`https://x.test/a`
`http://x.test:80/`	`http://x.test/`
`https://x.test`	`https://x.test/`
`https://x.test/a/?q=1#f`	`https://x.test/a?q=1`
`https://x.test/a?b=2&a=1`	`https://x.test/a?b=2&a=1`

12.5. Retry-After parsing (now = `Mon, 23 May 2026 00:00:00 GMT`)

Retry-After	delay
`120`	120 s
`0`	0 s
`Mon, 23 May 2026 00:00:30 GMT`	30 s
`Mon, 23 May 2026 00:00:00 GMT`	0 s
`Sun, 23 May 2026 00:00:00 GMT` (past)	0 s

12.6. ISO-8601 refresh / TTL durations

duration	ms
`PT6H`	21600000
`PT30M`	1800000
`P1D`	86400000
`PT1H30M`	5400000

13. Conformance fixtures

Live URLs a tool can fetch to self-verify the gates end-to-end. They are enumerated machine-readably in test-fixtures.json (contract walsh-research-test-fixtures/v1):

{
  "contract": "walsh-research-test-fixtures/v1",
  "updated":  "2026-05-23T00:00:00Z",
  "fixtures": {
    "robots_allowed_sentinel": "https://wal.sh/research/bots/dogfood-allow",
    "robots_disallowed":       "https://wal.sh/research/bots/dogfood-disallow",
    "walsh_only_allowed":      "https://wal.sh/research/bots/dogfood-walsh-only",
    "external_positive":       "https://pypi.org/project/jsonschema/",
    "operator_blocked_host":   "example.com",
    "flaky_429_endpoint":      null
  },
  "notes": {
    "external_positive": "PyPI project page (robots-allowed). Do NOT use /pypi/<name>/json: pypi.org Disallows /pypi/*/json, so a robots-respecting bot must refuse it.",
    "flaky_429_endpoint": "no live R5 fixture; tools SHOULD use a local mock"
  }
}

Fixture	Expected for Walsh-Research
`robots_allowed_sentinel`	fetched (200); allowed by robots
`robots_disallowed`	refused at R2, every representation (`.html`, `.md`, none)
`walsh_only_allowed`	fetched; `*` bots are `Disallow`-ed here, our named group is not
`external_positive`	fetched (200); a robots-allowed third-party page, sandbox-reachable
`operator_blocked_host`	any URL on this host refused at R3 (live blocklist entry)

Verify these live (same site): the fixtures index test-fixtures.json, the canaries dogfood-allow / dogfood-disallow / dogfood-walsh-only, the self-describing blocklist.json and its blocklist.schema.json, and robots.txt.

The external positive is pypi.org/project/jsonschema/ because PyPI is reachable from inside the sandboxes LLM harnesses run in (the package index is allowlisted almost universally); a self-test target must work there or the spec cannot be self-verified from a single-prompt bootstrap. Note the correction: the obvious pypi.org/pypi/<name>/json API path is robots-disallowed (Disallow: /pypi/*/json), so a conformant bot must refuse it – the project page is the correct allowed positive. No live 429 fixture is provided — R5 is verified against a local mock.

14. Attestation

A tool MAY publish a self-attestation so publishers and auditors can compare implementations. It is not consulted by any gate — robots.txt and the blocklist already decide routing — it is ecosystem hygiene only.

Contract walsh-research-attestation/v1:

{
  "contract":     "walsh-research-attestation/v1",
  "tool":         "<implementer-chosen name>",
  "tool_version": "<semver or freeform>",
  "spec":         "walsh-research-compliance/v1.3",
  "user_agent":   "Mozilla/5.0 (compatible; Walsh-Research/1.2; +https://wal.sh/bot/)",
  "requirements": { "R1": true, "R2": true, "R3": true, "...": true }
}

The publishing location is the implementer's choice (a URL on the operator's site, a release artifact). The spec defines the document shape, not where it lives. Signing (e.g. Ed25519 over the JSON canonical form, keys at a published JWKS) is a future option, deferred until a consumer needs it.

15. Conformance checklist

A tool is conformant when every MUST holds.

#	Requirement	Level
R1	exact User-Agent; token case-insensitive; no impersonation	MUST
R2	robots.txt fetched, cached (`ROBOTS_TTL_SECONDS`), obeyed; RFC 9309	MUST
R2a	named `Walsh-Research` group overrides `*`, longest-match selection	MUST
R2b	path: longest-match, Allow wins ties, `*=/=$` wildcards	MUST
R2c	may over-block (skip Allow/wildcards), MUST NOT under-block	MUST
R2d	Crawl-delay honored	MUST
R3	operator blocklist consulted before robots	MUST
R3a	apex + subdomain match, case-insensitive, port-stripped	MUST
R3b	retain last list on fetch failure / 404	MUST
R3c	validate against schema; reject invalid / unknown contract major	MUST
R4	serial; <= 1 req/s/host; spacing = max(1s, Crawl-delay); unseen no wait	MUST
R5	429/503 backoff + full jitter; cap; <= 5 retries; Retry-After	MUST
R6	no recursion / sub-resources / training / republish; metadata-only	MUST
R7	conditional fetch (ETag / Last-Modified) + canonical-URL dedup	SHOULD
R8	infrequent (daily or less)	SHOULD
R9	feed > content-neg markdown > llms.txt; never scrape HTML	SHOULD/MUST NOT
R10	schema TTL; stale-on-failure; first-fetch-fail MUST NOT adopt	MUST
R11	persist caches across invocations (format = implementer's choice)	SHOULD
R12	cache-miss: block-on-refresh default (SWR with 2×TTL cutoff allowed)	clarification

15.1. Reference-implementation notes

Two complementary references, neither the contract:

jwalsh/tech-crawler (Clojure): satisfies every MUST except R2b — its path matcher is a conservative subset (honors Disallow prefixes and correct longest-match group selection; treats Allow and *=/=$ as no-ops), so it may over-block (permitted by R2c). Group-token longest-match selection is correct.
Python conformance harness: full RFC 9309 path matching (Allow + wildcards) via the platform robots parser, but first-match group selection rather than longest-match (R2a), so it adds a selection layer. Verified 21/21 against a localhost mock plus a live smoke test. Archived in aygp-dr/walsh-research-compliance/archive/.

Together they cover both halves of R2; a fully-conformant tool does both.

16. Self-test canaries (dogfood)

We hold our own tools to this spec against the live site. Three sibling pages under /research/bots/ exercise the R2 gate in all directions at one path depth:

Canary	`*` sees	Walsh-Research sees	Bot must
dogfood-disallow	allowed	`Disallow`	refuse
dogfood-allow	allowed	allowed	fetch
dogfood-walsh-only	`Disallow`	`Allow` (named)	fetch

User-agent: *
Disallow: /research/bots/dogfood-walsh-only

User-agent: Walsh-Research
Disallow: /research/bots/dogfood-disallow
Allow:    /research/bots/dogfood-allow
Allow:    /research/bots/dogfood-walsh-only
Crawl-delay: 2

dogfood-disallow proves we honor Disallow across every content-negotiated representation (.html, .md, extensionless): the decision is made from the path before any fetch.
dogfood-allow proves we do not over-block an allowed sibling.
dogfood-walsh-only proves the other direction of R2a: a generic * bot is disallowed, but our named group selects past * and is permitted, so we fetch. It doubles as a honeypot: since * is disallowed, the only legitimate fetchers are humans (browsers ignore robots.txt) and Walsh-Research, so any other agent in the access logs is a robots-violating bot.

These pages are intentionally undocumented elsewhere — they exist only to be crawled or refused by a conformance run. A conformant tool SHOULD assert all three outcomes against the live robots.txt as an attestation.

17. Opt-out workflow (operator side)

A site self-serves via robots.txt (R2) at any time — honored within the robots cache TTL.
Or emails j@wal.sh; the operator appends a {domain, added, reason} entry to blocklist.json and deploys — honored within the refresh TTL.

A tool implementer's only job is to consume both sources correctly.

18. Build recipe

A suggested order to build a Walsh-Research/x.y tool from this document alone:

Primitives. HTTP client, XML+JSON parser, monotonic clock + sleep, SHA-256. (20 maps these per language.)
R1. Hard-code USER_AGENT; send it on every request.
R2 parser. Groups -> selection -> path matching; pass the robots 12; cache per host (ROBOTS_TTL_SECONDS).
R3 + R10. Fetch schema (cache SCHEMA_TTL_SECONDS, stale-on-failure), validate the blocklist, cache it (refresh), domain matching, retain-on-failure; pass the blocklist vectors.
Gate. Wire may_fetch in the normative order: blocklist -> robots -> throttle.
R4 + R5. Throttle and backoff; pass the Retry-After / duration vectors.
R7 + R11. Conditional headers, canonical-URL dedup, persistent caches.
R9. Feeds; markdown content negotiation and /llms.txt fallbacks; generic link extraction. Never per-site selectors.
Attest. Run against the live 13 and the 16; emit an 14 recording the spec version targeted.

A tool that reproduces every 12 block, passes the live fixtures and all three dogfood canaries, with the gate in the normative order, is conformant with walsh-research-compliance/v1.3.

19. Deferred (out of scope for v1.2)

Content-Signal (draft-romm-aipref-contentsignals; ai-train, search, ai-input, advertised in wal.sh/robots.txt) is orthogonal to the four primitives and binds only a pipeline that performs the named use. The current bot is a metadata extractor that does not feed an LLM, so ai-input is not binding. A future bot version that adds LLM summarization would make ai-input binding on its intake; defining which pipeline operations count is a v2 item.
Attestation signing and a keys endpoint — deferred until a consumer needs verifiable attestations.
Cross-process throttle backend (Redis / SQLite token bucket) — deferred; single-instance scheduling is the v1.2 default (R4a).
Redirect policy (C6) — 3xx handling is undefined. Following a redirect can leave robots scope (/research/aoc/src redirects to /research/aoc/src/ which may have different robots rules). Proposed for v1.4: re-run the Location header value through the full gate pipeline (R3→R2→R4) before fetching the redirected URL. Current behavior (transparent redirect via the HTTP library) is acceptable for v1.2/v1.3 but should be tightened.
Transport error retry (C5) — R5 covers 429/503 but is silent on connection/TLS/timeout failures. Proposed for v1.4: apply the same backoff policy to transport errors. Surfaced by 6+ implementations in the compliance harness.
R13×R2 interaction (C1) — R13 appends query parameters; R2 path matching includes query. Resolution adopted by all 29 implementations: gates evaluate the untagged URL; R13 params are appended only at HTTP transmission. Proposed for v1.4: codify this as normative.

20. Porting notes

Nothing here depends on a language. Each requirement is behavior over four primitives every language has; only the primitives change.

Primitive	Clojure	Python	Rust	Guile	TypeScript	Go
HTTP client	clj-http	httpx / requests	reqwest	web client	fetch / undici	net/http
XML parse	clojure.data.xml	xml.etree	quick-xml	sxml	fast-xml-parser	encoding/xml
JSON parse	cheshire	json	serde_json	guile-json	JSON.parse	encoding/json
Clock+sleep	currentTimeMillis/sleep	time / sleep	std::time	usleep	Date.now/setTimeout	time
SHA-256	MessageDigest	hashlib	sha2	gcrypt	crypto.subtle	crypto/sha256

21. References

RFC 9309 — Robots Exclusion Protocol (robots.txt).
RFC 8615 — Well-Known URIs (/.well-known/).
RFC 9110 — HTTP Semantics (User-Agent §10.1.5; Retry-After §10.2.3).
RFC 2119 / RFC 8174 — requirement keywords.
draft-romm-aipref-contentsignals / contentsignals.org — Content-Signal.
llmstxt.org — the llms.txt convention (Howard, 2024).

22. Changelog

v1 (2026-05-23) — initial: R1 UA, R2 robots/RFC 9309 + Crawl-delay, R3 blocklist + schema, R4 rate limit, R5 backoff, R6 scope, R7 conditional fetch + dedup, R8 frequency; rev 2 added R9 (structured formats); rev 3 added the dogfood canary pair.
v1.1 (2026-05-23) — make the spec a standalone, single-prompt-bootstrappable build target. Folds in v1 rev 1–3. Tightens R2 to full RFC 9309 path matching (Allow precedence + *=/=$, with a safe over-block allowance, R2b/R2c) and longest-match group selection (R2a). Promotes schema validation to MUST (R3c). Adds R10 (schema cache + stale-on-failure), R11 (persistent caches), R12 (cache-miss behavior). Pins exact algorithms for R3 domain matching, R5 Retry-After, R7 canonical URL, and durations, plus a single 6 table. Adds 12, live 13 (test-fixtures.json) with the dogfood-walsh-only bidirectional canary and a sandbox-reachable external positive, an 14 contract, a 18, and the 4 model. Code examples are Clojure (the reference language) shown for demonstration only; conformance is defined by behavior, contracts, and test vectors, and is language-agnostic.
v1.2 (2026-05-24) — bump bot identity from Walsh-Research/1.1 to Walsh-Research/1.2; update USER_AGENT constant, attestation example, and all normative references accordingly. Add R13 (implementation tagging). No other behavioral requirements changed.
v1.3 (2026-05-24) — add R13 (implementation tagging, SHOULD). Add bot compliance lab with 19 per-bot canary pages at /research/bots/lab/disallow-{token}/ for empirical robots.txt compliance verification. Seven implementations across six languages (TypeScript, Clojure, Python, Elisp, Guile, JavaScript) updated to v1.2 bot identity and v1.3 spec references.

22.1. R13: Implementation tagging (SHOULD)

When a conformant tool runs its test suite against the live 13 or compliance-relevant URLs (robots.txt, blocklist.json, dogfood-* pages), it SHOULD append query parameters identifying the implementation:

https://wal.sh/robots.txt?impl=walsh/research&sha=afd2c01&spec=walsh-research-compliance/v1.3

Parameter	Value	Purpose
`impl`	`{org}/{repo}`	Which implementation is running the test
`sha`	short git SHA	Which commit of that implementation
`spec`	spec version string	Which spec version the tool targets

This enables the spec operator to cross-reference access logs against known implementations without requiring out-of-band coordination. The query parameters MUST NOT affect the semantics of the request — they are observability metadata only.

A tool MAY auto-detect its identity from GITHUB_REPOSITORY / GITHUB_SHA environment variables (CI) or from git remote get-url origin / git rev-parse --short HEAD (local).

This requirement is a SHOULD because it aids operator observability but is not necessary for correct protocol behavior. A tool that omits these parameters is still conformant.

(defn tag-compliance-url
  "Append implementation identity to compliance fixture URLs."
  [url {:keys [repo sha spec-version]}]
  (if (and repo (str/includes? url "wal.sh"))
    (str url
         (if (str/includes? url "?") "&" "?")
         "impl=" (java.net.URLEncoder/encode repo "UTF-8")
         (when sha (str "&sha=" sha))
         "&spec=" (or spec-version spec-version))
    url))

;; Usage in the pre-request pipeline:
(tag-compliance-url
  "https://wal.sh/robots.txt"
  {:repo "walsh/research" :sha "afd2c01" :spec-version spec-version})
;; => "https://wal.sh/robots.txt?impl=walsh%2Fresearch&sha=afd2c01&spec=walsh-research-compliance/v1.3"