Skip to main content
The Crawl API allows you to programmatically crawl and index websites into your knowledge base. This enables your AI agents to access and reference content from your website, documentation, or any other web-based resources when responding to customer inquiries.

Overview

Website crawling enables:
  • Automated Content Indexing - Automatically extract and index content from websites
  • Knowledge Base Integration - Crawled content is added directly to your knowledge base
  • Real-time Status Tracking - Monitor crawl progress and completion status
  • Flexible Configuration - Control include/exclude paths, page limits, and crawl intervals
  • Page Management - Exclude, include, delete, or resync individual pages

How It Works

  1. Create a Datasource - Provide a website URL and configuration options
  2. Crawl Starts Automatically - By default, a crawl begins immediately after creation
  3. Monitor Progress - Check crawl status and track page processing
  4. Manage Pages - Review crawled pages, exclude irrelevant ones, or resync outdated content
  5. Scheduled Recrawls - Datasources automatically recrawl on a configurable interval

Crawl Job Statuses

  • pending - Crawl job created, waiting to start
  • scraping - Crawl is actively running and extracting content
  • completed - Crawl finished successfully, content has been indexed
  • failed - Crawl encountered an error and could not complete
  • cancelled - Crawl was manually cancelled before completion

Page Sync Statuses

  • synced - Page content is indexed in the knowledge base
  • pending - Page is waiting to be synced
  • error - Page failed to sync
  • excluded - Page is excluded from syncing
Crawling large websites can take significant time and resources. Use include/exclude paths to focus on relevant content and set appropriate page limits.

Available Endpoints

Datasource Management

Crawl Operations

Page Management