Understanding Datasets¶

Datasets are the core organizational unit in Gaia. A dataset is a named collection of indexed data from one or more Cohesity-managed sources. Every query you send to Gaia runs against one or more datasets.

What is a Dataset?¶

Think of a dataset as a searchable index over a slice of your enterprise data. When you create a dataset, you tell Gaia:

Which data sources to include (backup snapshots, file shares, M365 mailboxes, etc.)
A name to reference it by in API calls

Gaia then indexes the content — extracting text, chunking it, generating embeddings, and storing everything in its vector store. Once indexing is complete, the dataset is ready for queries.

Text Only

┌─────────────────────┐
│    Data Sources      │       ┌───────────────────────┐
│                      │       │       Dataset          │
│  • Backup snapshots  │──────▶│  "financial-reports"   │
│  • File shares       │       │                        │
│  • M365 mailboxes    │       │  Status: ready         │
│  • Uploaded files     │       │  Objects: 1,247        │
│                      │       │  Last indexed: 2h ago  │
└─────────────────────┘       └───────────────────────┘
                                        │
                                        ▼
                               ┌────────────────┐
                               │  Vector Store   │
                               │  (Embeddings)   │
                               └────────────────┘

Dataset Lifecycle¶

Every dataset goes through the following stages:

Text Only

Create  ──▶  Configure Sources  ──▶  Index  ──▶  Query
                                       ▲            │
                                       │            │
                                       └── Re-index ┘

1. Create¶

A new, empty dataset is created with a name and optional description.

2. Configure Sources¶

Data sources are attached to the dataset — these define what data Gaia should index.

3. Index¶

Gaia processes the data sources: extracts text from documents, splits content into chunks, generates vector embeddings, and populates the vector store. This can take minutes to hours depending on data volume.

4. Query¶

Once indexing completes, the dataset is in ready state and can be queried via /ask, /ask/stream, or /ask/exhaustive.

Re-index¶

When source data changes (new backups, updated files), you can trigger re-indexing to keep the dataset current.

Dataset States¶

State	Description
`ready`	Indexing is complete. The dataset is queryable.
`indexing`	Data is currently being indexed. Partial queries may work.
`error`	Indexing failed. Check the error details and data source configuration.

Querying during indexing

You can query a dataset while it's in the indexing state. Gaia will search whatever has been indexed so far. Results may be incomplete until indexing finishes.

Types of Data Sources¶

Gaia can index data from any source that Cohesity manages:

Source Type	Examples
Backup Snapshots	VM backups, NAS backups, database backups — query data from any point-in-time snapshot
File Shares	SMB/NFS shares protected by Cohesity
Microsoft 365	Exchange mailboxes, OneDrive files, SharePoint sites
Databases	SQL Server, Oracle, and other databases backed up by Cohesity
Uploaded Files	Ad-hoc documents uploaded directly via the Gaia upload API

Data stays in Cohesity

Gaia doesn't copy your data to an external service. Indexing and vector storage happen within the Cohesity platform, preserving your existing security perimeter.

API Operations¶

List All Datasets¶

Retrieve a list of all available datasets.

Python

async with GaiaClient.from_env() as gaia:
    datasets = await gaia.list_datasets()

    for ds in datasets:
        print(f"{ds.name:30s}  status={ds.status}  objects={ds.object_count}")

Response fields on each dataset:

Field	Type	Description
`name`	`str`	Unique dataset name
`status`	`str`	Current state: `ready`, `indexing`, or `error`
`description`	`str?`	Optional description
`objectCount`	`int?`	Number of indexed objects
`createdAt`	`str?`	ISO 8601 creation timestamp
`updatedAt`	`str?`	ISO 8601 last-updated timestamp

Get Dataset Details¶

Retrieve detailed information about a specific dataset, including its data sources and indexing statistics.

Python

async with GaiaClient.from_env() as gaia:
    details = await gaia.get_dataset("financial-reports")

    print(f"Name:    {details.name}")
    print(f"Status:  {details.status}")
    print(f"Objects: {details.object_count}")

    if details.data_sources:
        print("\nData Sources:")
        for src in details.data_sources:
            print(f"  • {src.get('type', 'unknown')}: {src.get('name', 'N/A')}")

    if details.indexing_stats:
        stats = details.indexing_stats
        print(f"\nIndexing Stats:")
        print(f"  Total docs:   {stats.get('totalDocuments', 'N/A')}")
        print(f"  Indexed docs: {stats.get('indexedDocuments', 'N/A')}")
        print(f"  Failed docs:  {stats.get('failedDocuments', 'N/A')}")

Create a Dataset¶

Create a new dataset programmatically.

Python

async with GaiaClient.from_env() as gaia:
    result = await gaia.create_dataset(
        name="incident-logs-2025",
        description="Server and application logs from 2025 backups",
    )
    print(f"Created dataset: {result}")

Dataset creation is just step one

Creating a dataset via the API gives you an empty dataset. Data sources are typically configured through the Cohesity UI, though some configurations can also be set via the API. After configuring sources, you need to trigger indexing.

Trigger Indexing¶

Start or restart the indexing process for a dataset.

Python

async with GaiaClient.from_env() as gaia:
    result = await gaia.trigger_indexing("incident-logs-2025")
    print(f"Indexing triggered: {result}")

Delete a Dataset¶

Remove a dataset and its indexed data.

Python

async with GaiaClient.from_env() as gaia:
    result = await gaia.delete_dataset("old-dataset")
    print(f"Deleted: {result}")

Deletion is permanent

Deleting a dataset removes all indexed data (embeddings, chunks, metadata) from the vector store. The original source data in Cohesity is not affected.

Monitoring Indexing Progress¶

For large datasets, indexing can take a significant amount of time. Here's how to monitor progress:

Python

import asyncio
from gaia_sdk import GaiaClient


async def wait_for_indexing(dataset_name: str, poll_interval: int = 30):
    """Poll a dataset until indexing is complete."""
    async with GaiaClient.from_env() as gaia:
        while True:
            details = await gaia.get_dataset(dataset_name)

            if details.status == "ready":
                print(f"\n✓ Dataset '{dataset_name}' is ready!")
                if details.indexing_stats:
                    stats = details.indexing_stats
                    print(f"  Total documents:   {stats.get('totalDocuments', 'N/A')}")
                    print(f"  Indexed documents: {stats.get('indexedDocuments', 'N/A')}")
                return details

            if details.status == "error":
                print(f"\n✗ Dataset '{dataset_name}' encountered an error.")
                return details

            # Still indexing
            progress = ""
            if details.indexing_stats:
                stats = details.indexing_stats
                total = stats.get("totalDocuments", 0)
                indexed = stats.get("indexedDocuments", 0)
                if total > 0:
                    pct = (indexed / total) * 100
                    progress = f" ({indexed}/{total} = {pct:.0f}%)"

            print(f"  ⟳ Indexing{progress}... checking again in {poll_interval}s")
            await asyncio.sleep(poll_interval)


asyncio.run(wait_for_indexing("incident-logs-2025"))

Best Practices¶

Naming Conventions¶

Use clear, descriptive, lowercase names with hyphens:

Good	Bad
`financial-reports-2025`	`FinReports`
`it-runbooks-prod`	`dataset1`
`hr-policies-global`	`HR stuff`
`m365-ceo-mailbox`	`mailbox`

Dataset Organization¶

One dataset per domain or use case

Organize datasets by business domain or use case rather than by technical source. A dataset called incident-investigation that pulls from server logs, ticketing systems, and runbooks is more useful than three separate datasets you'd need to query individually.

Group related data. Combine data sources that are commonly queried together into a single dataset.
Separate by access control. If different user groups should access different data, use separate datasets and control access at the session/application level.
Create temporal datasets. For compliance or forensics, create datasets scoped to specific time ranges (e.g., logs-q1-2025).

Refresh Schedules¶

Static data (historical archives, compliance snapshots) — Index once, no refresh needed.
Slowly changing data (policies, runbooks, documentation) — Re-index weekly or on change.
Frequently changing data (logs, emails, tickets) — Re-index daily or trigger re-indexing when new backups complete.

Querying Multiple Datasets¶

You can query across multiple datasets in a single /ask call by passing multiple names:

Python

async with GaiaClient.from_env() as gaia:
    response = await gaia.ask(
        dataset_names=["it-runbooks", "incident-logs-2025"],
        query="How do we handle a database failover?",
    )

Gaia merges results from all specified datasets and uses the combined context to generate the answer. This is powerful for cross-domain queries like "What was the root cause of the March 5 outage?" that might span infrastructure logs, runbooks, and incident tickets.

Performance consideration

Querying multiple datasets is slightly slower than querying a single dataset because Gaia searches each dataset's vector store independently and then merges the results. For latency-sensitive applications, prefer a single well-organized dataset.

Next Steps¶

Now that you understand datasets, you're ready to build:

Datasets & Indexing — Deep dive into creating, configuring, and indexing datasets.
Querying & RAG — Learn how to query your datasets with the /ask endpoint.
Backend with FastAPI — Wrap these operations in a proper backend.