The Website source crawls a domain you control and ingests its pages into the AI’s training index. Use it when the content you want the AI to know about is reachable at a public URL — a help center, a product docs site, a marketing pages surface — and you don’t have a direct integration for the system that serves it.Documentation Index
Fetch the complete documentation index at: https://docs.open.cx/llms.txt
Use this file to discover all available pages before exploring further.
When to use the crawler vs a direct integration
| Situation | Use |
|---|---|
| You publish your help center through Zendesk, Intercom, or Front | The dedicated Zendesk / Intercom / Front source |
| Your content lives in Confluence, Notion, GitBook, or Freshdesk | The dedicated source for that tool |
| Your content is reachable at a public URL and nothing above applies | The crawler |
| The site is gated behind auth | Not today — the crawler only fetches public URLs |
What the crawler does
- Kicks off a crawl against the URL you provide.
- Follows links within the domain up to
page_limitpages (default 100, max 5000). - Extracts main content as Markdown, skipping navigation chrome and binary assets.
- Tracks a content hash per page — re-crawls only re-index pages whose content changed.
- Re-runs on a schedule you pick (
crawl_interval_hours, default 168 hours / 7 days). - Exposes every discovered page so you can exclude individual URLs, re-include ones you excluded, or force-resync one page.
Related Documentation
Connect a website
URL, limits, include/exclude paths, crawl interval.
Troubleshooting
Stuck crawls, locale scoping, pages not indexing.
Crawl API
Programmatic control of datasources, crawls, and pages.
Connect a knowledge source
Decision matrix for all sources.