Skip to main content

DataCollection

Technical documentation for this endpoint can be found at https://api.scaigrid.ai/swagger/#/DataCollection.

Data-collection is the process of gathering data and information for use in (chat)completions. Data can be gathered from various sources like OneDrive, file shares, cloud services, API's, databases or custom build connectors. Each data-collection process is defined by a series of steps and conversions of the data for usage in ScaiGrid. 

Indexer

The main purpose of an indexer is to determine new, modified and deleted objects and prepare the data before it's transferred to ScaiGrid. For each indexer a specific storage (Vector database) is assigned in ScaiGrid.

A default indexer implementation consists of the following steps:

  1. Indexation
    1. Dataset querying
      Retrieves the list of objects (dataset) from a given datasource. Optionally this could contain filter, like specific document-extensions or subsets of data from an API or database.
    2. Indexing
      Determine new, modified and deleted objects in the dataset using a left-right compare algorithm. The indexer will store these results to determine a modification or deletion in a future run.
    3. Metadata extraction
      For each of the objects found through indexation a metadata extraction is done. This step will extract metadata from the object-information contained in the datasource, not the contents itself. This could be a file-extension, specific keywords in the filename, the endpoint location on an external API or the table in which it is stored.
  2. Pre-processing
    1. Object data pulling
      Retrieves the actual content from the datasource and creates (data)chunks for processing.
    2. Object metadata extraction
      Extracts metadata from the contents (each chunk) of the object.
    3. Data scrubbing
      Removes unnecessary information from the object through various iterations of data scrubbing techniques, depending of the kinds of data retrieved. This is also known as the clean-up step. For example the removal of tags in a HTML-document.
    4. Tokenization [Optional]
      When possible tokenization of the data will be performed client-side. If not provided in the endpoint requests ScaiGrid will take care of this.
  3. Transferring
    1. Collection registration
    2. Object registration
    3. Chunk transferring

If you want to store data for a different purpose, it's most likely that the implementation of the above steps will change. You might need different metadata, datasource filter or even data scrubbing techniques. Because an indexer is linked to a specific storage solution, this will result in a different indexer registration in ScaiGrid including storage.

Batch

A batch represents a set of objects that are logically grouped together. Batches are processed in the order they are announced, first-in-first-out (FIFO). In the simplest form an indexer will announce a collection for each indexation job that is performed containing the objects that are listed for addition, modification or deletion.

Object

An object is a structured representation of data in the widest form. Objects can be concrete things, such as a document or a webpage, but can also be more abstract. The indexer determines the definition of the object and the form it is presented in, meaning an object can be anything in any form fitted to the needs of the implementation and purpose of the object stored for use in (chat)completions. 

Examples of an object are:

  • Document
  • Webpage or a complete website
  • A single (database) record or a complete (database) table
  • Response from an API or a collection of responses combined

Chunk

A chunk, also known as a data-chunk, contains the smallest logical representation of a part of an object, such as a paragraph of a document. By splitting the object into smaller parts it is possible to quote and refer to specific parts of an object instead of referring to the complete object itself. This gives more context and better results in completions. Combined with specific metadata for the chunk the context can be enriched for even better results.