Skip to content

Understanding Datasets

Datasets are the core organizational unit in Gaia. A dataset is a named collection of indexed data from one or more Cohesity-managed sources. Every query you send to Gaia runs against one or more datasets.


What is a Dataset?

Think of a dataset as a searchable index over a slice of your enterprise data. When you create a dataset, you tell Gaia:

  • Which data sources to include (backup snapshots, file shares, M365 mailboxes, etc.)
  • A name to reference it by in API calls

Gaia then indexes the content — extracting text, chunking it, generating embeddings, and storing everything in its vector store. Once indexing is complete, the dataset is ready for queries.

Text Only
┌─────────────────────┐
│    Data Sources      │       ┌───────────────────────┐
│                      │       │       Dataset          │
│  • Backup snapshots  │──────▶│  "financial-reports"   │
│  • File shares       │       │                        │
│  • M365 mailboxes    │       │  Status: ready         │
│  • Uploaded files     │       │  Objects: 1,247        │
│                      │       │  Last indexed: 2h ago  │
└─────────────────────┘       └───────────────────────┘
                               ┌────────────────┐
                               │  Vector Store   │
                               │  (Embeddings)   │
                               └────────────────┘

Dataset Lifecycle

Every dataset goes through the following stages:

Text Only
Create  ──▶  Configure Sources  ──▶  Index  ──▶  Query
                                       ▲            │
                                       │            │
                                       └── Re-index ┘

1. Create

A new, empty dataset is created with a name and optional description.

2. Configure Sources

Data sources are attached to the dataset — these define what data Gaia should index.

3. Index

Gaia processes the data sources: extracts text from documents, splits content into chunks, generates vector embeddings, and populates the vector store. This can take minutes to hours depending on data volume.

4. Query

Once indexing completes, the dataset is in ready state and can be queried via /ask, /ask/stream, or /ask/exhaustive.

Re-index

When source data changes (new backups, updated files), you can trigger re-indexing to keep the dataset current.


Dataset States

State Description
ready Indexing is complete. The dataset is queryable.
indexing Data is currently being indexed. Partial queries may work.
error Indexing failed. Check the error details and data source configuration.

Querying during indexing

You can query a dataset while it's in the indexing state. Gaia will search whatever has been indexed so far. Results may be incomplete until indexing finishes.


Types of Data Sources

Gaia can index data from any source that Cohesity manages:

Source Type Examples
Backup Snapshots VM backups, NAS backups, database backups — query data from any point-in-time snapshot
File Shares SMB/NFS shares protected by Cohesity
Microsoft 365 Exchange mailboxes, OneDrive files, SharePoint sites
Databases SQL Server, Oracle, and other databases backed up by Cohesity
Uploaded Files Ad-hoc documents uploaded directly via the Gaia upload API

Data stays in Cohesity

Gaia doesn't copy your data to an external service. Indexing and vector storage happen within the Cohesity platform, preserving your existing security perimeter.


API Operations

List All Datasets

Retrieve a list of all available datasets.

Python
async with GaiaClient.from_env() as gaia:
    datasets = await gaia.list_datasets()

    for ds in datasets:
        print(f"{ds.name:30s}  status={ds.status}  objects={ds.object_count}")

Response fields on each dataset:

Field Type Description
name str Unique dataset name
status str Current state: ready, indexing, or error
description str? Optional description
objectCount int? Number of indexed objects
createdAt str? ISO 8601 creation timestamp
updatedAt str? ISO 8601 last-updated timestamp

Get Dataset Details

Retrieve detailed information about a specific dataset, including its data sources and indexing statistics.

Python
async with GaiaClient.from_env() as gaia:
    details = await gaia.get_dataset("financial-reports")

    print(f"Name:    {details.name}")
    print(f"Status:  {details.status}")
    print(f"Objects: {details.object_count}")

    if details.data_sources:
        print("\nData Sources:")
        for src in details.data_sources:
            print(f"  • {src.get('type', 'unknown')}: {src.get('name', 'N/A')}")

    if details.indexing_stats:
        stats = details.indexing_stats
        print(f"\nIndexing Stats:")
        print(f"  Total docs:   {stats.get('totalDocuments', 'N/A')}")
        print(f"  Indexed docs: {stats.get('indexedDocuments', 'N/A')}")
        print(f"  Failed docs:  {stats.get('failedDocuments', 'N/A')}")

Create a Dataset

Create a new dataset programmatically.

Python
async with GaiaClient.from_env() as gaia:
    result = await gaia.create_dataset(
        name="incident-logs-2025",
        description="Server and application logs from 2025 backups",
    )
    print(f"Created dataset: {result}")

Dataset creation is just step one

Creating a dataset via the API gives you an empty dataset. Data sources are typically configured through the Cohesity UI, though some configurations can also be set via the API. After configuring sources, you need to trigger indexing.

Trigger Indexing

Start or restart the indexing process for a dataset.

Python
async with GaiaClient.from_env() as gaia:
    result = await gaia.trigger_indexing("incident-logs-2025")
    print(f"Indexing triggered: {result}")

Delete a Dataset

Remove a dataset and its indexed data.

Python
async with GaiaClient.from_env() as gaia:
    result = await gaia.delete_dataset("old-dataset")
    print(f"Deleted: {result}")

Deletion is permanent

Deleting a dataset removes all indexed data (embeddings, chunks, metadata) from the vector store. The original source data in Cohesity is not affected.


Monitoring Indexing Progress

For large datasets, indexing can take a significant amount of time. Here's how to monitor progress:

Python
import asyncio
from gaia_sdk import GaiaClient


async def wait_for_indexing(dataset_name: str, poll_interval: int = 30):
    """Poll a dataset until indexing is complete."""
    async with GaiaClient.from_env() as gaia:
        while True:
            details = await gaia.get_dataset(dataset_name)

            if details.status == "ready":
                print(f"\n✓ Dataset '{dataset_name}' is ready!")
                if details.indexing_stats:
                    stats = details.indexing_stats
                    print(f"  Total documents:   {stats.get('totalDocuments', 'N/A')}")
                    print(f"  Indexed documents: {stats.get('indexedDocuments', 'N/A')}")
                return details

            if details.status == "error":
                print(f"\n✗ Dataset '{dataset_name}' encountered an error.")
                return details

            # Still indexing
            progress = ""
            if details.indexing_stats:
                stats = details.indexing_stats
                total = stats.get("totalDocuments", 0)
                indexed = stats.get("indexedDocuments", 0)
                if total > 0:
                    pct = (indexed / total) * 100
                    progress = f" ({indexed}/{total} = {pct:.0f}%)"

            print(f"  ⟳ Indexing{progress}... checking again in {poll_interval}s")
            await asyncio.sleep(poll_interval)


asyncio.run(wait_for_indexing("incident-logs-2025"))

Best Practices

Naming Conventions

Use clear, descriptive, lowercase names with hyphens:

Good Bad
financial-reports-2025 FinReports
it-runbooks-prod dataset1
hr-policies-global HR stuff
m365-ceo-mailbox mailbox

Dataset Organization

One dataset per domain or use case

Organize datasets by business domain or use case rather than by technical source. A dataset called incident-investigation that pulls from server logs, ticketing systems, and runbooks is more useful than three separate datasets you'd need to query individually.

  • Group related data. Combine data sources that are commonly queried together into a single dataset.
  • Separate by access control. If different user groups should access different data, use separate datasets and control access at the session/application level.
  • Create temporal datasets. For compliance or forensics, create datasets scoped to specific time ranges (e.g., logs-q1-2025).

Refresh Schedules

  • Static data (historical archives, compliance snapshots) — Index once, no refresh needed.
  • Slowly changing data (policies, runbooks, documentation) — Re-index weekly or on change.
  • Frequently changing data (logs, emails, tickets) — Re-index daily or trigger re-indexing when new backups complete.

Querying Multiple Datasets

You can query across multiple datasets in a single /ask call by passing multiple names:

Python
async with GaiaClient.from_env() as gaia:
    response = await gaia.ask(
        dataset_names=["it-runbooks", "incident-logs-2025"],
        query="How do we handle a database failover?",
    )

Gaia merges results from all specified datasets and uses the combined context to generate the answer. This is powerful for cross-domain queries like "What was the root cause of the March 5 outage?" that might span infrastructure logs, runbooks, and incident tickets.

Performance consideration

Querying multiple datasets is slightly slower than querying a single dataset because Gaia searches each dataset's vector store independently and then merges the results. For latency-sensitive applications, prefer a single well-organized dataset.


Next Steps

Now that you understand datasets, you're ready to build: