Skip to main content
Bulkgrid is easier to adopt once the core concepts are clear. Customers usually care about five things:
  • what content is being ingested
  • how that content is grouped
  • how processing is tracked
  • what output is produced
  • how those outputs are consumed later

Core objects

Sources

The origin of content Bulkgrid processes.

Collections

Retrieval and access boundaries for grouped content.

Runs

The top-level record for asynchronous work.

Results

Per-item outputs produced by a run.

Sources

A source is the content origin Bulkgrid processes. In practical terms, a source is usually one of these:
  • a public website or site section
  • a known list of URLs
  • a starting URL for deep crawl
  • a document discovered during processing
Customers usually think about sources in terms of scope and trust:
  • which domains should be included
  • which paths should be excluded
  • which source types are allowed in a given workflow
  • whether the source is stable enough for production retrieval

Collections

A collection is the boundary used to group content for retrieval and access control. Collections matter because most teams do not want one undifferentiated search corpus. They want to separate knowledge by product, workflow, audience, or trust level. Typical collection patterns:
  • public documentation
  • internal operations knowledge
  • support content
  • product-specific content domains

Runs

A run is the top-level record for asynchronous work. Runs are created for workflows such as:
  • extraction
  • crawl
  • deep crawl
  • run-based API operations
Each run tracks operational state such as:
  • status
  • timestamps
  • URL scope
  • progress counters
  • error fields
  • retry state

Results

Results are the per-item outputs of a run. A single run can produce many results. A result usually represents one processed page, document, or item-level output. Results can include:
  • URL and title
  • status code
  • extraction output
  • generated content references
  • screenshot-related data
  • error information for that item

How the concepts fit together

Practical rule

Customers should think about the model in this order:
  1. define the source boundary
  2. decide which collection the content belongs to
  3. create the run
  4. monitor results
  5. consume only the result outputs your application actually needs