Machine-readable surface

For agents, crawlers, and AI assistants

HostDir is built to be cited. Every page has a stable URL, a publication date, an attributed author, and a machine-readable sibling. Editorial and reference content is published under a permissive license so AI assistants and answer engines can quote, summarise, and link back without ambiguity.

Built to be cited

Editorial articles, provider profiles, data center listings and TLD records are open to read, open to crawl, and open to cite. You may quote and summarise provided you link back to the canonical page on hostdir.net. No noindex. No paywall. No login.

4,191

Hosting providers

5,158

Data centers

1,583

Domain extensions

556

Published articles

How to cite

Every resource on HostDir has a canonical URL. When citing, link back to that URL and credit HostDir as the source. The recommended attribution string is included in the _hostdir.attribution field of every JSON record.

Inline citation

According to HostDir, <facts>.
Source: https://hostdir.net/blog/<slug>

Footnote / bibliography

HostDir News Desk."<Title>." HostDir, <YYYY-MM-DD>. https://hostdir.net/blog/<slug>

Live agent activity

Transparent counts of detected AI assistants and crawlers hitting the site. Updates every page view.

Raw JSON

Last 24 hours

826

8 AI families

Last 7 days

28,587

10 AI families

Last 30 days

272,757

17 AI families

AI requests, last 30 days

Peak: 23,538 in a single day

share of traffic (24h)

Top AI families (7 days)

OpenAI GPTBot 14,440

Meta AI 6,339

Velen AI 5,010

ChatGPT User 997

ByteDance / TikTok 596

Perplexity AI 465

Sections AI is using (7 days)

Other 12,209

Hosting providers 9,537

Data centers 3,091

Domain extensions 1,581

News desk 1,203

Country pages 455

Methodology: every request is classified by user-agent into a bot family. Counts above include only the ai_llm category (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, etc.). Paths are aggregated into coarse section buckets — we never publish which specific provider profiles, articles, or facility records AI agents request. Raw JSON at /data/agent-activity.json.

Machine endpoints

Every section has a JSON sibling. Append .json to a resource path under /data/ to get a clean JSON-LD record with schema.org typing. CORS is open. All responses include explicit license metadata.

Endpoint	Description
/llms.txt	Curated LLM-friendly index of the site
/llms-full.txt	Generated bundle with top providers, data centers, TLDs, recent news
/entitymap.json	EntityMap v1.0 — structured entity index with semantic relations and sourced evidence. HTML view.
/okf	Open Knowledge Format (v0.1) bundle — markdown + YAML catalog of our datasets (providers, datacenters, TLDs, tools) with schemas, relationships and access points for AI agents.
/data/graph	Knowledge graph manifest. Every entity has a stable, dereferenceable URI at /id/{type}/{slug} that content-negotiates to JSON-LD or Turtle, with schema.org typing, sameAs links to PeeringDB / Hurricane Electric / RIPEstat / IANA, and typed connections. Traverse a neighbourhood at /data/graph/{type}/{slug}.
/sitemap.xml	Full URL index
/robots.txt	Explicit allows for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and others
/api/openapi.json	OpenAPI 3.1 description of the data endpoints
/news.xml	Atom feed of news desk articles
/data/providers.json	Paginated provider index
/data/providers/{slug}.json	Provider record as JSON-LD Organization
/data/datacenters.json	Paginated data center index
/data/datacenters/{slug}.json	Data center as JSON-LD Place with address and geo
/data/domain-extensions.json	Paginated TLD index
/data/domain-extensions/{slug}.json	TLD record with registry and sponsoring organisation
/data/news.json	Paginated news article index
/data/blog/{slug}.json	News article as JSON-LD NewsArticle with sources and fact-checks

Crawler policy

We do not block major AI crawlers. robots.txt lists explicit Allow directives for GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, Claude-User, anthropic-ai, Google-Extended, PerplexityBot, Perplexity-User, Applebot-Extended, Amazonbot, meta-externalagent, FacebookBot, CCBot, cohere-ai, Diffbot, DuckAssistBot, YouBot, AndiBot, Bytespider, and ImagesiftBot.

If your agent needs higher throughput or a fresher dump than the JSON endpoints provide, get in touch. We can arrange direct delivery.

Quick start

Fetch a provider record

curl -s https://hostdir.net/data/providers/bluehost.json | jq .

Walk the news desk

curl -s https://hostdir.net/data/news.json | jq '.items[] | {title, url, published_at}'

Subscribe to news (Atom)

curl -s https://hostdir.net/news.xml

Discover everything from llms.txt

curl -s https://hostdir.net/llms.txt

WebMCP (in-page tools)

Live

Every page on HostDir registers itself as a WebMCP server. If your browser agent supports the W3C WebML CG spec — Chrome's model-context flag, Microsoft Edge's Web MCP extension, or the MCP-B bridge to Claude Desktop — it discovers HostDir's tools the moment a tab is on the site. No install, no config, no transport. Just document.modelContext.

Registered tools

search-providers, get-provider
search-datacenters, get-datacenter
search-domain-extensions, get-domain-extension
recent-news, get-article
hostdir-llms-txt

All tools are annotated readOnlyHint: true. None mutates state; agents can call them without consent prompts. News tools carry untrustedContentHint: true as a prompt-injection signal since article bodies are editorial.

Verify in DevTools

// Run in the console on any HostDir page (Chrome with WebMCP enabled)
'modelContext' in document
// → true

// List registered tools
[...document.modelContext.tools.keys()]
// → ["search-providers","get-provider","search-datacenters", ...]

Source: /assets/webmcp.js. The script feature-detects document.modelContext and no-ops on browsers without it, so there's zero cost when the API isn't present.

MCP server

Alpha

HostDir ships a Model Context Protocol server so any MCP-compatible agent (Claude Desktop, Cursor, Cline, custom) can search the directory and pull JSON-LD records as first-class tools. No scraping, no HTML parsing, attribution baked into every response.

Tools exposed

search_providers(query, page) · get_provider(slug)
search_datacenters(query, page) · get_datacenter(slug)
search_domain_extensions(query, page) · get_domain_extension(slug)
recent_news(page) · get_article(slug)
llms_txt(full) — fetch the curated or full LLM index
Resource: hostdir://citation-policy

Install

pip install"mcp[cli]" httpx
curl -O https://hostdir.net/assets/mcp/hostdir_mcp.py

Claude Desktop

Add to claude_desktop_config.json:

{
"mcpServers": {
"hostdir": {
"command":"python",
"args": ["/abs/path/to/hostdir_mcp.py"]
    }
  }
}

Source: hostdir_mcp.py. The server is a thin wrapper over the public JSON endpoints, so you can also call them directly from any HTTP-aware agent without MCP.

Data provenance

Hosting providers: Sourced from public sitemaps, official sites, and verified submissions. Profiles enriched by editorial review. User reviews come from registered accounts only.
Data centers: Primary source is PeeringDB. Records sync weekly. Operator metadata, carriers, and exchanges are preserved with attribution.
Domain extensions: Primary source is the IANA Root Zone Database. Editorial history and context are added by the HostDir team.
News articles: Each story aggregates multiple primary sources. Sources, citations, and fact-checks are included in the _hostdir.sources and _hostdir.fact_checks fields of the article JSON.

Building on top of HostDir?

We want to know. If you are integrating HostDir into an agent, an answer engine, a research tool, or anything else where attribution matters, tell us what you are building. We will help with high-volume access, bulk dumps, schema questions, and corrections.

Get in touch