DataCollection
Technical documentation for this endpoint can be found at https://api.scaigrid.ai/swagger/#/DataCollection.
Data-collection is the process of gathering data and information for use in (chat)completions. Data can be gathered from various sources like OneDrive, file shares, cloud services, API's, databases or custom build connectors. Each data-collection process is defined by a series of steps and conversions of the data for usage in ScaiGrid.
Indexer
The main purpose of an indexer is to determine new, modified and deleted objects and prepare the data before it's transferred to ScaiGrid. For each indexer a specific datastore (Vector database) is assigned in ScaiGrid.
A default indexer implementation consists of the following steps:
- Indexation
- Dataset querying
Retrieves the list of objects (dataset) from a given datasource. Optionally this could contain filter, like specific document-extensions or subsets of data from an API or database. - Indexing
Determine new, modified and deleted objects in the dataset using a left-right compare algorithm. The indexer will store these results to determine a modification or deletion in a future run. - Metadata extraction
For each of the objects found through indexation a metadata extraction is done. This step will extract metadata from the object-information contained in the dataset, not the contents itself. This could be a file-extension, specific keywords in the filename, the endpoint location on an external API or the table in which it is stored.
- Dataset querying
- Pre-processing
- Object data pulling
Retrieves the actual content from the dataset. - Data scrubbing
Removes unnecessary information from the object through various iterations of data scrubbing techniques, depending of the kinds of data retrieved. This is also known as the clean-up step. For example the removal of tags in a HTML-document. - Chunking
Creates (data)chunks for processing from the actual content. - Object metadata extraction
Extracts metadata from the contents (each chunk) of the object. - Tokenization [Optional]
When possible tokenization of the data will be performed client-side. If not provided in the endpoint requests ScaiGrid will take care of this.
- Object data pulling
- Transferring
- Batch registration
- Object registration
- Chunk transferring
If you want to store data for a different purpose, it's most likely that the implementation of the above steps will change. You might need different metadata, datasource filter or even data scrubbing techniques. Because an indexer is linked to a specific storage solution, this will result in a different indexer registration in ScaiGrid including storage.
Batch
A batch represents a set of objects that are logically grouped together. Batches are processed in the order they are announced, first-in-first-out (FIFO). In the simplest form an indexer will announce a batch for each indexation job that is performed containing the objects that are listed for addition, modification or deletion.
Object
An object is a structured representation of data in the widest form. Objects can be concrete things, such as a document or a webpage, but can also be more abstract. The indexer determines the definition of the object and the form it is presented in, meaning an object can be anything in any form fitted to the needs of the implementation and purpose of the object stored for use in (chat)completions.
Examples of an object are:
- Document
- Webpage or a complete website
- A single (database) record or a complete (database) table
- Response from an API or a collection of responses combined
Chunk
A chunk, also known as a data-chunk, contains the smallest logical representation of a part of an object, such as a paragraph of a document. By splitting the object into smaller parts it is possible to quote and refer to specific parts of an object instead of referring to the complete object itself. This gives more context and better results in completions. Combined with specific metadata for the chunk the context can be enriched for even better results.
No Comments