Datasets & Indexing¶

Datasets are the foundational building block of every Gaia application. A dataset is a named collection of data sources — file shares, NAS volumes, object stores, or uploaded documents — whose contents are chunked, embedded, and stored in a vector index so Gaia can retrieve them at query time.

This page walks through the full lifecycle: create → configure → index → monitor → query.

Key Concepts¶

Concept	Description
Dataset	A named container that groups one or more data sources under a single queryable index.
Data source	A reference to an external location (e.g., a Cohesity View, NFS export, S3 bucket) whose files are ingested.
Indexing	The background process that reads source files, splits them into chunks, generates vector embeddings, and writes them to the search index.
Indexing window	An optional schedule that limits when indexing runs (e.g., off-peak hours only).
Incremental update	Re-indexing that processes only new or changed files since the last run.

Listing Datasets¶

Before creating anything, check what already exists.

Python SDKcURL

Python

import asyncio
from gaia_sdk import GaiaClient

async def main():
    async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
        datasets = await gaia.list_datasets()
        for ds in datasets:
            print(f"{ds.name}  status={ds.status}  objects={ds.object_count}")

asyncio.run(main())

Bash

curl -s \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/datasets" | python -m json.tool

Filter by prefix

list_datasets(prefix="finance-") returns only datasets whose names start with the given string — useful when you partition datasets by team or domain.

Creating a Dataset¶

A new dataset needs at minimum a name and one or more data sources.

Python SDKcURL

Python

async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    result = await gaia.create_dataset(
        name="engineering-docs",
        description="Internal engineering wiki and design documents",
        dataSources=[
            {
                "sourceType": "CohesityView",
                "viewName": "eng-wiki",
                "includePaths": ["/design-docs", "/runbooks"],
                "excludePaths": ["/archive"],
            }
        ],
    )
    print(result)

Bash

curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  -H "Content-Type: application/json" \
  "https://helios.cohesity.com/v2/mcm/gaia/datasets" \
  -d '{
    "name": "engineering-docs",
    "description": "Internal engineering wiki and design documents",
    "dataSources": [
      {
        "sourceType": "CohesityView",
        "viewName": "eng-wiki",
        "includePaths": ["/design-docs", "/runbooks"],
        "excludePaths": ["/archive"]
      }
    ]
  }'

Data Source Options¶

Field	Type	Description
`sourceType`	`string`	`"CohesityView"`, `"NFS"`, `"SMB"`, `"S3"`, etc.
`viewName`	`string`	Name of the Cohesity View (for `CohesityView` type).
`includePaths`	`string[]`	Paths within the source to include (default: all).
`excludePaths`	`string[]`	Paths to exclude from indexing.
`fileFilters`	`object`	Fine-grained filters by extension, size, or modified date.

Configuring Indexing¶

Indexing configuration controls what gets indexed and when.

File Filters¶

Restrict which files are ingested based on extension or size:

Python

await gaia.create_dataset(
    name="legal-contracts",
    dataSources=[
        {
            "sourceType": "CohesityView",
            "viewName": "legal-share",
            "fileFilters": {
                "includeExtensions": [".pdf", ".docx"],
                "excludeExtensions": [".tmp", ".bak"],
                "maxFileSizeMB": 100,
            },
        }
    ],
)

Indexing Window¶

An indexing window limits processing to specific hours so production workloads remain unaffected:

JSON

{
  "indexingWindow": {
    "enabled": true,
    "startHour": 22,
    "endHour": 6,
    "timezone": "America/Los_Angeles"
  }
}

Clock skew

Ensure the cluster's NTP configuration is accurate — a clock drift of more than a few minutes can cause indexing to start or stop outside the expected window.

Triggering Indexing¶

After creating a dataset you must trigger indexing to start the initial ingestion. Subsequent re-index runs pick up only new and modified files (incremental updates).

Python SDKcURL

Python

async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    await gaia.trigger_indexing("engineering-docs")
    print("Indexing triggered.")

Bash

curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/engineering-docs/index"

Monitoring Indexing Progress¶

Poll GET /dataset/{name}/details and inspect the indexingStats object to track progress.

Python SDKcURL

Python

async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    details = await gaia.get_dataset("engineering-docs")

    print(f"Status : {details.status}")
    print(f"Objects: {details.object_count}")

    if details.indexing_stats:
        stats = details.indexing_stats
        print(f"  Total files    : {stats.get('totalFiles', 'N/A')}")
        print(f"  Indexed files  : {stats.get('indexedFiles', 'N/A')}")
        print(f"  Failed files   : {stats.get('failedFiles', 'N/A')}")
        print(f"  Index size     : {stats.get('indexSizeBytes', 'N/A')} bytes")

Bash

curl -s \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/engineering-docs/details" \
  | python -m json.tool

`indexingStats` Reference¶

Field	Type	Description
`totalFiles`	`int`	Total number of files discovered in the data source.
`indexedFiles`	`int`	Files successfully indexed so far.
`failedFiles`	`int`	Files that failed during processing.
`indexSizeBytes`	`int`	Size of the vector index on disk.
`lastIndexedAt`	`string`	ISO-8601 timestamp of the last completed indexing run.
`status`	`string`	`"Running"`, `"Completed"`, `"Failed"`, etc.

Polling for Indexing Completion¶

For automation workflows you'll want to poll until indexing finishes:

Python

import asyncio
from gaia_sdk import GaiaClient

async def wait_for_indexing(gaia: GaiaClient, name: str, poll_seconds: int = 10):
    """Poll until dataset indexing reaches a terminal state."""
    while True:
        details = await gaia.get_dataset(name)
        status = (details.indexing_stats or {}).get("status", "Unknown")
        indexed = (details.indexing_stats or {}).get("indexedFiles", 0)
        total   = (details.indexing_stats or {}).get("totalFiles", 0)

        print(f"[{name}] {status} — {indexed}/{total} files")

        if status in ("Completed", "Failed"):
            return details

        await asyncio.sleep(poll_seconds)

Re-indexing and Incremental Updates¶

Gaia supports incremental indexing by default. When you trigger indexing on an already-indexed dataset:

New files are chunked, embedded, and added to the index.
Modified files (detected via checksum or last-modified timestamp) are re-processed, and stale chunks are replaced.
Deleted files have their chunks removed from the index.

To force a full re-index (e.g., after changing chunking settings), pass the appropriate flag:

Bash

curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  -H "Content-Type: application/json" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/engineering-docs/index" \
  -d '{"fullReindex": true}'

Full re-index duration

A full re-index processes every file regardless of change status. For large datasets this can take significantly longer — schedule it during your indexing window when possible.

Dataset Discovery¶

Once a dataset is indexed, the discovery endpoint provides a hierarchical view of all ingested content — useful for building file browsers or verifying that the right documents were picked up.

`GET /dataset/{id}/discovery`¶

Python SDKcURL

Python

async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    tree = await gaia.get_discovery("engineering-docs")

    def print_tree(node, indent=0):
        prefix = "  " * indent
        print(f"{prefix}{node.get('name', '/')}  ({node.get('type', 'dir')})")
        for child in node.get("children", []):
            print_tree(child, indent + 1)

    print_tree(tree)

Bash

curl -s \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/engineering-docs/discovery" \
  | python -m json.tool

The response is a recursive tree structure:

JSON

{
  "name": "/",
  "type": "directory",
  "children": [
    {
      "name": "design-docs",
      "type": "directory",
      "children": [
        { "name": "architecture.pdf", "type": "file", "sizeBytes": 204800 },
        { "name": "api-spec.md", "type": "file", "sizeBytes": 15360 }
      ]
    }
  ]
}

Complete Example: Create → Index → Poll → Query¶

Putting it all together — a single script that creates a dataset, triggers indexing, waits for completion, and runs a test query.

Python

import asyncio
from gaia_sdk import GaiaClient

DATASET_NAME = "quarterly-reports"

async def main():
    async with GaiaClient(api_key="YOUR_API_KEY") as gaia:

        # 1. Create the dataset
        print("Creating dataset …")
        await gaia.create_dataset(
            name=DATASET_NAME,
            description="FY25 quarterly earnings reports",
            dataSources=[
                {
                    "sourceType": "CohesityView",
                    "viewName": "finance-share",
                    "includePaths": ["/reports/FY25"],
                    "fileFilters": {
                        "includeExtensions": [".pdf", ".xlsx"],
                    },
                }
            ],
        )

        # 2. Trigger indexing
        print("Triggering indexing …")
        await gaia.trigger_indexing(DATASET_NAME)

        # 3. Poll until complete
        while True:
            details = await gaia.get_dataset(DATASET_NAME)
            stats = details.indexing_stats or {}
            status = stats.get("status", "Unknown")
            indexed = stats.get("indexedFiles", 0)
            total = stats.get("totalFiles", 0)

            print(f"  Indexing: {status} — {indexed}/{total} files")
            if status in ("Completed", "Failed"):
                break
            await asyncio.sleep(10)

        if status == "Failed":
            print("Indexing failed. Check the cluster logs.")
            return

        # 4. Run a test query
        print("\nRunning test query …")
        response = await gaia.ask(
            dataset_names=[DATASET_NAME],
            query="What was the Q3 revenue?",
        )
        print(f"Answer: {response.response_string}")

        if response.documents:
            print(f"\nSource documents ({len(response.documents)}):")
            for doc in response.documents:
                print(f"  • {doc.filename} (score: {doc.score:.3f})")

asyncio.run(main())

Best Practices¶

Naming conventions

Use lowercase, hyphenated names like engineering-docs or finance-q3-2025. Consistent naming makes filtering with list_datasets(prefix=...) straightforward.

Small, focused datasets

Prefer many small datasets over one monolithic one. Smaller datasets index faster, return more relevant results, and are easier to manage.

Include/Exclude paths

Always scope your data sources with includePaths and excludePaths to avoid ingesting temporary files, build artifacts, or irrelevant directories.

Monitor after every indexing run

Always check failedFiles after indexing completes. A non-zero count may indicate unsupported formats, permission issues, or corrupt files.

Platform Limits & Considerations¶

Official Gaia Limits

These limits are enforced by the Gaia platform. Plan your dataset strategy around them.

Limit	Value
Max datasets per user account	50
Max objects per dataset	1,000
Max extracted text per dataset	20 GB (status becomes "Warning" when exceeded)
Max file size	100 MB
Dataset scope	Single cluster only (cannot span multiple clusters)
Snapshot types	Backup snapshots only (replication and archival not supported)

Do Not Delete Indexed Snapshots

If a backup snapshot is indexed and utilized by a dataset, do not delete the snapshot from the Cohesity cluster. Deleting it will prevent Gaia from answering queries for that dataset.

Supported File Types¶

Gaia can index the following file formats:

Category	Extensions
Microsoft Office	`.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`
PDF & Document	`.pdf`, `.odf`, `.rtf`
Text & Web	`.txt`, `.html`, `.xml`

Indexing Modes¶

When creating a dataset, you choose an indexing mode that determines how and when data is processed:

Continuous Indexing¶

Gaia automatically checks for new snapshots every hour. When a new snapshot is available, indexing begins automatically.

The dataset always reflects the latest data
If a file is removed in a newer snapshot, it's no longer available in the dataset
Topic Explorer topics refresh every 7 days
Requires Cohesity cluster version 7.2.2_u1 or later

Python

await gaia.create_dataset(
    name="live-reports",
    indexing_mode="continuous",
    dataSources=[{"sourceType": "CohesityView", "viewName": "reports"}],
)

One-Time Indexing¶

Index a specific point-in-time snapshot:

Most recent snapshot — Indexes the latest available backup
Custom date range — Indexes data within a specific time window (currently available for Microsoft 365 object types only)

On-Demand Indexing¶

For datasets created with Continuous Indexing, you can trigger an immediate index refresh:

APIGaia UI

Bash

curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/live-reports/index"

Navigate to Datasets, hover over the dataset, click the vertical ellipsis, and select Run Now.

Inclusion & Exclusion Rules¶

When creating datasets, you can filter what gets indexed using file types and directory paths.

File Type Filters¶

Include or exclude specific file extensions. By default all supported types are included.

Directory Path Rules¶

Use regex patterns to include or exclude specific directory paths:

Pattern	Effect
`/MyFiles/Payroll/.*`	Index all files in `/MyFiles/Payroll/` and subdirectories
`/myFiles/Payroll.*`	Index directories starting with "Payroll" under `/myFiles/`
`.logs.`	Match any path containing "logs"

Rule Precedence

Exclusion takes precedence over inclusion. If a file matches both an include and exclude rule, it is excluded.

M365 Mailbox Restriction

For Microsoft 365 Mailbox objects, you cannot use file type or directory filters.

What's Next¶

With datasets created and indexed, you're ready to start querying.

Querying & RAG — learn how Gaia's retrieval-augmented generation pipeline works.
Topic Explorer — explore themes and suggested questions from your indexed data.
Document Upload — add ad-hoc files to a dataset via the upload API.
Dataset Discovery — explore the indexed content hierarchy in detail.

Datasets & Indexing¶

Key Concepts¶

Listing Datasets¶

Creating a Dataset¶

Data Source Options¶

Configuring Indexing¶

File Filters¶

Indexing Window¶

Triggering Indexing¶

Monitoring Indexing Progress¶

indexingStats Reference¶

Polling for Indexing Completion¶

Re-indexing and Incremental Updates¶

Dataset Discovery¶

GET /dataset/{id}/discovery¶

Complete Example: Create → Index → Poll → Query¶

Best Practices¶

Platform Limits & Considerations¶

Supported File Types¶

Indexing Modes¶

Continuous Indexing¶

One-Time Indexing¶

On-Demand Indexing¶

Inclusion & Exclusion Rules¶

File Type Filters¶

Directory Path Rules¶

What's Next¶

`indexingStats` Reference¶

`GET /dataset/{id}/discovery`¶