Research Crawler — Sources Overview
Table of Contents
- Overview
- Live — structured adapters
- Live — markdown (feed-less, no scraping)
- Live — feeds (RSS/Atom)
- Mechanistic interpretability & ML (interp)
- Evals & safety (evals)
- Formal methods, distributed systems, correctness (formal-methods)
- LLM tooling & agents (agents)
- Clojure / Scheme (lang)
- SDR & aviation (sdr-aviation)
- Corporate AI labs (labs)
- Corporate systems / distsys / PLT (corp-systems)
- Corporate engineering blogs (corp-engineering) — new 2026-05-31
- Surveillance & critique (critique)
- BSD & systems (systems)
- Registered, not yet live
- Compliance notes
Overview
What the agentic-research crawler (jwalsh/tech-crawler) currently checks. Every source is fetched as a compliant Walsh-Research bot — robots.txt (RFC 9309), operator blocklist, ≤1 req/s/domain + Crawl-delay — per the compliance spec.
- Live: ~86 sources — 5 structured adapters, 2 markdown, 79 RSS/Atom feeds
- Tracked threads: agents, evals, interp, formal methods, surveillance/critique, BSD/systems, Clojure/Scheme, SDR/aviation, corp-systems, corp-engineering
This page is a curated overview, grouped by thread with representative sources; the exact, machine-readable feed manifest (URLs, tiers, crawl cadence) is maintained in tech-crawler and is the source of truth. Sources are added only after verifying a real, crawlable endpoint (a working feed / API / markdown, and robots.txt that allows us).
Recent growth: the 2026-05-31 run crawled 79 feeds, up from 61 on 2026-05-29 — the jump is the new corp-engineering thread (12 engineering blogs) plus the promotion of the surveillance/critique and BSD/systems sources that were previously registered-but-not-live.
Live — structured adapters
| Source | Type | How | Tier |
|---|---|---|---|
| arxiv-cs-ai | papers | HTML (enlive) | 1 |
| github-trending | repos | HTML (enlive) | 1 |
| hn-front | discussion | HTML (enlive) | 1 |
| harvard-seas | events | JSON (Localist API) | 1 |
| mit-calendar | events | JSON (Localist API) | 2 |
Live — markdown (feed-less, no scraping)
Crawled via a structured representation instead of HTML scraping — content
negotiation (Accept: text/markdown) or the /llms.txt convention.
| Source | How | URL |
|---|---|---|
| cloudflare-docs | content negotiation (md) | developers.cloudflare.com |
| mcp-docs | /llms.txt index | modelcontextprotocol.io/llms.txt |
Live — feeds (RSS/Atom)
79 feeds across the threads below. Representative sources per thread; see tech-crawler for the full enumeration.
Mechanistic interpretability & ML (interp)
transformer-circuits.pub · Neel Nanda · Lilian Weng · Interconnects · Eugene Yan
Evals & safety (evals)
Alignment Forum · METR · OpenAI (frontier governance) · Anthropic (alignment)
Formal methods, distributed systems, correctness (formal-methods)
Aphyr/Jepsen · Antithesis · Marc Brooker · Murat Demirbas · Hillel Wayne · James Bornholt
LLM tooling & agents (agents)
Simon Willison · Latent Space · Anthropic Research · Anthropic News · Claude Code releases
Clojure / Scheme (lang)
Planet Clojure · Clojure Deref · Lambda Island · Scheme/Guile dev feeds
SDR & aviation (sdr-aviation)
The Air Current · RTL-SDR · aviation-data feeds
Corporate AI labs (labs)
Hugging Face Blog · OpenAI · Google Research · Microsoft Research · Apple ML Research
Corporate systems / distsys / PLT (corp-systems)
Cloudflare (blog) · Netflix Tech Blog · All Things Distributed (Werner Vogels) · Jane Street
Corporate engineering blogs (corp-engineering) — new 2026-05-31
Tailscale · TigerBeetle · Ink & Switch · DuckDB · GitHub Engineering · Fly.io · Grafana · ClickHouse · Databricks · Neon · Supabase · Vercel
Surveillance & critique (critique)
404 Media · EFF Deeplinks · Pluralistic · Logic Magazine · Schneier on Security · Citizen Lab
BSD & systems (systems)
LWN · Klara Systems · FreeBSD Foundation · Hackaday · Slashdot
Registered, not yet live
| Category | Sources |
|---|---|
| structured | huggingface-models · papers-with-code · semantic-scholar |
| aggregator | Lobsters (robots.txt disallows us — kept out) |
Compliance notes
- Lobsters is registered but not crawled: its robots.txt disallows all non-allowlisted bots, and we honor that.
- Crawl-delay observed and respected, e.g. arXiv 15s, Hacker News 30s.
- The corp-engineering thread is real feeds only: blogs without a working RSS/Atom or markdown representation are left out rather than scraped.
- Full rules: Walsh-Research Bot Compliance Specification.