For agents, crawlers, and AI assistants
HostDir is built to be cited. Every page has a stable URL, a publication date, an attributed author, and a machine-readable sibling. Editorial and reference content is published under a permissive license so AI assistants and answer engines can quote, summarise, and link back without ambiguity.
Licensed for citation: CC BY 4.0
Editorial articles, provider profiles, data center listings, and TLD records are released under Creative Commons Attribution 4.0. You may quote, summarise, translate, and republish provided you link back to the canonical page on hostdir.net. No noindex. No paywall. No login.
How to cite
Every resource on HostDir has a canonical URL. When citing, link back to that URL and credit HostDir as the source. The recommended attribution string is included in the _hostdir.attribution field of every JSON record.
According to HostDir, <facts>.
Source: https://hostdir.net/blog/<slug>
HostDir News Desk. "<Title>." HostDir, <YYYY-MM-DD>. https://hostdir.net/blog/<slug> (CC BY 4.0).
Live agent activity
Transparent counts of detected AI assistants and crawlers hitting the site. Updates every page view.
Top AI families (7 days)
Sections AI is using (7 days)
Methodology: every request is classified by user-agent into a bot family. Counts above include only the ai_llm category (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, etc.). Paths are aggregated into coarse section buckets — we never publish which specific provider profiles, articles, or facility records AI agents request. User-area routes (dashboards, messages, profiles, auth) are excluded entirely. Raw JSON at /data/agent-activity.json.
Machine endpoints
Every section has a JSON sibling. Append .json to a resource path under /data/ to get a clean JSON-LD record with schema.org typing. CORS is open. All responses include explicit license metadata.
| Endpoint |
|---|
| /llms.txt |
| /llms-full.txt |
| /sitemap.xml |
| /robots.txt |
| /api/openapi.json |
| /news.xml |
| /data/providers.json |
| /data/providers/{slug}.json |
| /data/datacenters.json |
| /data/datacenters/{slug}.json |
| /data/domain-extensions.json |
| /data/domain-extensions/{slug}.json |
| /data/news.json |
| /data/blog/{slug}.json |
Crawler policy
We do not block major AI crawlers. robots.txt lists explicit Allow directives for GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, Claude-User, anthropic-ai, Google-Extended, PerplexityBot, Perplexity-User, Applebot-Extended, Amazonbot, meta-externalagent, FacebookBot, CCBot, cohere-ai, Diffbot, DuckAssistBot, YouBot, AndiBot, Bytespider, and ImagesiftBot.
Only admin areas, member dashboards, the installer, and the payment flow are disallowed. A 1–2 second crawl-delay applies to the heaviest crawlers as a courtesy, not a constraint.
If your agent needs higher throughput or a fresher dump than the JSON endpoints provide, get in touch. We can arrange direct delivery.
Quick start
Fetch a provider record
curl -s https://hostdir.net/data/providers/bluehost.json | jq .
Walk the news desk
curl -s https://hostdir.net/data/news.json | jq '.items[] | {title, url, published_at}'
Subscribe to news (Atom)
curl -s https://hostdir.net/news.xml
Discover everything from llms.txt
curl -s https://hostdir.net/llms.txt
WebMCP (in-page tools)
Live
Every page on HostDir registers itself as a WebMCP server. If your browser agent supports the W3C WebML CG spec — Chrome's model-context flag, Microsoft Edge's Web MCP extension, or the MCP-B bridge to Claude Desktop — it discovers HostDir's tools the moment a tab is on the site. No install, no config, no transport. Just document.modelContext.
Registered tools
search-providers,get-providersearch-datacenters,get-datacentersearch-domain-extensions,get-domain-extensionrecent-news,get-articlehostdir-llms-txt
All tools are annotated readOnlyHint: true. None mutates state; agents can call them without consent prompts. News tools carry untrustedContentHint: true as a prompt-injection signal since article bodies are editorial.
Verify in DevTools
// Run in the console on any HostDir page (Chrome with WebMCP enabled)
'modelContext' in document
// → true
// List registered tools
[...document.modelContext.tools.keys()]
// → ["search-providers", "get-provider", "search-datacenters", ...]
Source: /assets/webmcp.js. The script feature-detects document.modelContext and no-ops on browsers without it, so there's zero cost when the API isn't present.
MCP server
AlphaHostDir ships a Model Context Protocol server so any MCP-compatible agent (Claude Desktop, Cursor, Cline, custom) can search the directory and pull JSON-LD records as first-class tools. No scraping, no HTML parsing, attribution baked into every response.
Tools exposed
search_providers(query, page)·get_provider(slug)search_datacenters(query, page)·get_datacenter(slug)search_domain_extensions(query, page)·get_domain_extension(slug)recent_news(page)·get_article(slug)llms_txt(full)— fetch the curated or full LLM index- Resource:
hostdir://citation-policy
Install
pip install "mcp[cli]" httpx
curl -O https://hostdir.net/assets/mcp/hostdir_mcp.py
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"hostdir": {
"command": "python",
"args": ["/abs/path/to/hostdir_mcp.py"]
}
}
}
Source: hostdir_mcp.py. The server is a thin wrapper over the public JSON endpoints, so you can also call them directly from any HTTP-aware agent without MCP.
Data provenance
- Hosting providers: Sourced from public sitemaps, official sites, and verified submissions. Profiles enriched by editorial review. User reviews come from registered accounts only.
- Data centers: Primary source is PeeringDB. Records sync weekly. Operator metadata, carriers, and exchanges are preserved with attribution.
- Domain extensions: Primary source is the IANA Root Zone Database. Editorial history and context are added by the HostDir team.
- News articles: Each story aggregates multiple primary sources. Sources, citations, and fact-checks are included in the
_hostdir.sourcesand_hostdir.fact_checksfields of the article JSON.
Building on top of HostDir?
We want to know. If you are integrating HostDir into an agent, an answer engine, a research tool, or anything else where attribution matters, tell us what you are building. We will help with high-volume access, bulk dumps, schema questions, and corrections.
Get in touch