Skip to main content

DataCollection

Documentation for this endpoint can be found at https://api.scaigrid.ai/swagger/#/DataCollection.

Data-collection is the process of gathering data and information for use in (chat)completions. Data can be gathered from various sources like OneDrive, file shares, cloud services, API's, databases or custom build connectors. Each data-collection process is defined by a series of steps and conversions of the data for usage in ScaiGrid. 

Indexer

The main purpose of an indexer is to determine new, modified and deleted objects and prepare the data before it's transferred to ScaiGrid. For each indexer a specific storage is assigned in ScaiGrid in the form of a Vector database. 

A default indexer implementation consists of the following steps:

  1. Indexation
    1. Datasource querying
      Retrieves the list of objects from the datasource. Optionally this could contain filter, like specific document-extensions or subsets of data from an API or database.
    2. IndexationIndexing
      Uses the list of objects retrieved from the datasource to determine new, modified and deleted objects using a left-right compare algorithm. The indexer will store these results to determine a modification or deletion in a future run.
    3. Metadata extraction
      For each of the objects found through indexation a metadata extraction is done. This step will extract metadata from the object-information contained in the datasource, not the contents itself. This could be a file-extension, specific keywords in the filename, the endpoint location on an external API or the table in which it is stored.
  2. Pre-processing
    1. Object data pulling
      Retrieves the actual content from the datasource and creates (data)chunks for processing.
    2. Object metadata extraction
      Extracts metadata from the contents of the object, hence each chunk created in the previous step.
    3. Data scrubbing
      Removes unnecessary information from the object through various iterations of data scrubbing techniques, depending of the kinds of data retrieved. This is also known as the clean-up step. For example the removal of tags in a HTML-document.
    4. Tokenization [Optional]
      When possible tokenization of the data will be preformed client-side. If not provided in the endpoint requests ScaiGrid will take care of this.
  3. Transferring
    1. Collection registration
    2. Object registration
    3. Chunk transferring

If you want to store data for a different purpose, it's most likely that the implementation of the above steps will change. You might need different metadata, datasource filter or even data scrubbing techniques. Because an indexer is linked to a specific storage solution this will result in a different indexer registration in ScaiGrid including storage.

Collection

A collection represents a set of objects that are logically grouped together. Collections are processed in the order they are announced, first-in-first-out (FIFO). In the simplest form an indexer will announce a collection for each indexation run that is performed containing the objects that are listed for addition, modification or deletion.

Object

An object is a structured representation of data in the widest form. Objects can be concrete things, such as a document or a webpage, but can also be more abstract. The indexer determines the definition of the object and the form it is presented in, meaning an object can be anything in any form fitted to the needs of the implementation and purpose of the object stored for use in (chat)completions. 

Examples of an object are:

  • Document
  • Webpage or a complete website
  • A single (database) record or a complete (database) table
  • Response from an API or a collection of responses combined

Chunk

A chunk, also known as a data-chunk, contains the smallest logical representation of a part of an object, such as a paragraph of a document. By splitting the object into smaller parts it is possible to quote and refer to specific parts of an object instead of referring to the complete object itself. This gives more context and better results in completions. Combined with specific metadata for the chunk the context can be enriched for even better results.