Skip to content

Datasets & Indexing

Datasets are the foundational building block of every Gaia application. A dataset is a named collection of data sources — file shares, NAS volumes, object stores, or uploaded documents — whose contents are chunked, embedded, and stored in a vector index so Gaia can retrieve them at query time.

This page walks through the full lifecycle: create → configure → index → monitor → query.


Key Concepts

Concept Description
Dataset A named container that groups one or more data sources under a single queryable index.
Data source A reference to an external location (e.g., a Cohesity View, NFS export, S3 bucket) whose files are ingested.
Indexing The background process that reads source files, splits them into chunks, generates vector embeddings, and writes them to the search index.
Indexing window An optional schedule that limits when indexing runs (e.g., off-peak hours only).
Incremental update Re-indexing that processes only new or changed files since the last run.

Listing Datasets

Before creating anything, check what already exists.

Python
import asyncio
from gaia_sdk import GaiaClient

async def main():
    async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
        datasets = await gaia.list_datasets()
        for ds in datasets:
            print(f"{ds.name}  status={ds.status}  objects={ds.object_count}")

asyncio.run(main())
Bash
curl -s \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/datasets" | python -m json.tool

Filter by prefix

list_datasets(prefix="finance-") returns only datasets whose names start with the given string — useful when you partition datasets by team or domain.


Creating a Dataset

A new dataset needs at minimum a name and one or more data sources.

Python
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    result = await gaia.create_dataset(
        name="engineering-docs",
        description="Internal engineering wiki and design documents",
        dataSources=[
            {
                "sourceType": "CohesityView",
                "viewName": "eng-wiki",
                "includePaths": ["/design-docs", "/runbooks"],
                "excludePaths": ["/archive"],
            }
        ],
    )
    print(result)
Bash
curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  -H "Content-Type: application/json" \
  "https://helios.cohesity.com/v2/mcm/gaia/datasets" \
  -d '{
    "name": "engineering-docs",
    "description": "Internal engineering wiki and design documents",
    "dataSources": [
      {
        "sourceType": "CohesityView",
        "viewName": "eng-wiki",
        "includePaths": ["/design-docs", "/runbooks"],
        "excludePaths": ["/archive"]
      }
    ]
  }'

Data Source Options

Field Type Description
sourceType string "CohesityView", "NFS", "SMB", "S3", etc.
viewName string Name of the Cohesity View (for CohesityView type).
includePaths string[] Paths within the source to include (default: all).
excludePaths string[] Paths to exclude from indexing.
fileFilters object Fine-grained filters by extension, size, or modified date.

Configuring Indexing

Indexing configuration controls what gets indexed and when.

File Filters

Restrict which files are ingested based on extension or size:

Python
await gaia.create_dataset(
    name="legal-contracts",
    dataSources=[
        {
            "sourceType": "CohesityView",
            "viewName": "legal-share",
            "fileFilters": {
                "includeExtensions": [".pdf", ".docx"],
                "excludeExtensions": [".tmp", ".bak"],
                "maxFileSizeMB": 100,
            },
        }
    ],
)

Indexing Window

An indexing window limits processing to specific hours so production workloads remain unaffected:

JSON
{
  "indexingWindow": {
    "enabled": true,
    "startHour": 22,
    "endHour": 6,
    "timezone": "America/Los_Angeles"
  }
}

Clock skew

Ensure the cluster's NTP configuration is accurate — a clock drift of more than a few minutes can cause indexing to start or stop outside the expected window.


Triggering Indexing

After creating a dataset you must trigger indexing to start the initial ingestion. Subsequent re-index runs pick up only new and modified files (incremental updates).

Python
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    await gaia.trigger_indexing("engineering-docs")
    print("Indexing triggered.")
Bash
curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/engineering-docs/index"

Monitoring Indexing Progress

Poll GET /dataset/{name}/details and inspect the indexingStats object to track progress.

Python
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    details = await gaia.get_dataset("engineering-docs")

    print(f"Status : {details.status}")
    print(f"Objects: {details.object_count}")

    if details.indexing_stats:
        stats = details.indexing_stats
        print(f"  Total files    : {stats.get('totalFiles', 'N/A')}")
        print(f"  Indexed files  : {stats.get('indexedFiles', 'N/A')}")
        print(f"  Failed files   : {stats.get('failedFiles', 'N/A')}")
        print(f"  Index size     : {stats.get('indexSizeBytes', 'N/A')} bytes")
Bash
curl -s \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/engineering-docs/details" \
  | python -m json.tool

indexingStats Reference

Field Type Description
totalFiles int Total number of files discovered in the data source.
indexedFiles int Files successfully indexed so far.
failedFiles int Files that failed during processing.
indexSizeBytes int Size of the vector index on disk.
lastIndexedAt string ISO-8601 timestamp of the last completed indexing run.
status string "Running", "Completed", "Failed", etc.

Polling for Indexing Completion

For automation workflows you'll want to poll until indexing finishes:

Python
import asyncio
from gaia_sdk import GaiaClient

async def wait_for_indexing(gaia: GaiaClient, name: str, poll_seconds: int = 10):
    """Poll until dataset indexing reaches a terminal state."""
    while True:
        details = await gaia.get_dataset(name)
        status = (details.indexing_stats or {}).get("status", "Unknown")
        indexed = (details.indexing_stats or {}).get("indexedFiles", 0)
        total   = (details.indexing_stats or {}).get("totalFiles", 0)

        print(f"[{name}] {status}{indexed}/{total} files")

        if status in ("Completed", "Failed"):
            return details

        await asyncio.sleep(poll_seconds)

Re-indexing and Incremental Updates

Gaia supports incremental indexing by default. When you trigger indexing on an already-indexed dataset:

  1. New files are chunked, embedded, and added to the index.
  2. Modified files (detected via checksum or last-modified timestamp) are re-processed, and stale chunks are replaced.
  3. Deleted files have their chunks removed from the index.

To force a full re-index (e.g., after changing chunking settings), pass the appropriate flag:

Bash
curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  -H "Content-Type: application/json" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/engineering-docs/index" \
  -d '{"fullReindex": true}'

Full re-index duration

A full re-index processes every file regardless of change status. For large datasets this can take significantly longer — schedule it during your indexing window when possible.


Dataset Discovery

Once a dataset is indexed, the discovery endpoint provides a hierarchical view of all ingested content — useful for building file browsers or verifying that the right documents were picked up.

GET /dataset/{id}/discovery

Python
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    tree = await gaia.get_discovery("engineering-docs")

    def print_tree(node, indent=0):
        prefix = "  " * indent
        print(f"{prefix}{node.get('name', '/')}  ({node.get('type', 'dir')})")
        for child in node.get("children", []):
            print_tree(child, indent + 1)

    print_tree(tree)
Bash
curl -s \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/engineering-docs/discovery" \
  | python -m json.tool

The response is a recursive tree structure:

JSON
{
  "name": "/",
  "type": "directory",
  "children": [
    {
      "name": "design-docs",
      "type": "directory",
      "children": [
        { "name": "architecture.pdf", "type": "file", "sizeBytes": 204800 },
        { "name": "api-spec.md", "type": "file", "sizeBytes": 15360 }
      ]
    }
  ]
}

Complete Example: Create → Index → Poll → Query

Putting it all together — a single script that creates a dataset, triggers indexing, waits for completion, and runs a test query.

Python
import asyncio
from gaia_sdk import GaiaClient

DATASET_NAME = "quarterly-reports"

async def main():
    async with GaiaClient(api_key="YOUR_API_KEY") as gaia:

        # 1. Create the dataset
        print("Creating dataset …")
        await gaia.create_dataset(
            name=DATASET_NAME,
            description="FY25 quarterly earnings reports",
            dataSources=[
                {
                    "sourceType": "CohesityView",
                    "viewName": "finance-share",
                    "includePaths": ["/reports/FY25"],
                    "fileFilters": {
                        "includeExtensions": [".pdf", ".xlsx"],
                    },
                }
            ],
        )

        # 2. Trigger indexing
        print("Triggering indexing …")
        await gaia.trigger_indexing(DATASET_NAME)

        # 3. Poll until complete
        while True:
            details = await gaia.get_dataset(DATASET_NAME)
            stats = details.indexing_stats or {}
            status = stats.get("status", "Unknown")
            indexed = stats.get("indexedFiles", 0)
            total = stats.get("totalFiles", 0)

            print(f"  Indexing: {status}{indexed}/{total} files")
            if status in ("Completed", "Failed"):
                break
            await asyncio.sleep(10)

        if status == "Failed":
            print("Indexing failed. Check the cluster logs.")
            return

        # 4. Run a test query
        print("\nRunning test query …")
        response = await gaia.ask(
            dataset_names=[DATASET_NAME],
            query="What was the Q3 revenue?",
        )
        print(f"Answer: {response.response_string}")

        if response.documents:
            print(f"\nSource documents ({len(response.documents)}):")
            for doc in response.documents:
                print(f"  • {doc.filename} (score: {doc.score:.3f})")

asyncio.run(main())

Best Practices

Naming conventions

Use lowercase, hyphenated names like engineering-docs or finance-q3-2025. Consistent naming makes filtering with list_datasets(prefix=...) straightforward.

Small, focused datasets

Prefer many small datasets over one monolithic one. Smaller datasets index faster, return more relevant results, and are easier to manage.

Include/Exclude paths

Always scope your data sources with includePaths and excludePaths to avoid ingesting temporary files, build artifacts, or irrelevant directories.

Monitor after every indexing run

Always check failedFiles after indexing completes. A non-zero count may indicate unsupported formats, permission issues, or corrupt files.


Platform Limits & Considerations

Official Gaia Limits

These limits are enforced by the Gaia platform. Plan your dataset strategy around them.

Limit Value
Max datasets per user account 50
Max objects per dataset 1,000
Max extracted text per dataset 20 GB (status becomes "Warning" when exceeded)
Max file size 100 MB
Dataset scope Single cluster only (cannot span multiple clusters)
Snapshot types Backup snapshots only (replication and archival not supported)

Do Not Delete Indexed Snapshots

If a backup snapshot is indexed and utilized by a dataset, do not delete the snapshot from the Cohesity cluster. Deleting it will prevent Gaia from answering queries for that dataset.


Supported File Types

Gaia can index the following file formats:

Category Extensions
Microsoft Office .doc, .docx, .xls, .xlsx, .ppt, .pptx
PDF & Document .pdf, .odf, .rtf
Text & Web .txt, .html, .xml

Indexing Modes

When creating a dataset, you choose an indexing mode that determines how and when data is processed:

Continuous Indexing

Gaia automatically checks for new snapshots every hour. When a new snapshot is available, indexing begins automatically.

  • The dataset always reflects the latest data
  • If a file is removed in a newer snapshot, it's no longer available in the dataset
  • Topic Explorer topics refresh every 7 days
  • Requires Cohesity cluster version 7.2.2_u1 or later
Python
await gaia.create_dataset(
    name="live-reports",
    indexing_mode="continuous",
    dataSources=[{"sourceType": "CohesityView", "viewName": "reports"}],
)

One-Time Indexing

Index a specific point-in-time snapshot:

  • Most recent snapshot — Indexes the latest available backup
  • Custom date range — Indexes data within a specific time window (currently available for Microsoft 365 object types only)

On-Demand Indexing

For datasets created with Continuous Indexing, you can trigger an immediate index refresh:

Bash
curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  "https://helios.cohesity.com/v2/mcm/gaia/dataset/live-reports/index"

Navigate to Datasets, hover over the dataset, click the vertical ellipsis, and select Run Now.


Inclusion & Exclusion Rules

When creating datasets, you can filter what gets indexed using file types and directory paths.

File Type Filters

Include or exclude specific file extensions. By default all supported types are included.

Directory Path Rules

Use regex patterns to include or exclude specific directory paths:

Pattern Effect
/MyFiles/Payroll/.* Index all files in /MyFiles/Payroll/ and subdirectories
/myFiles/Payroll.* Index directories starting with "Payroll" under /myFiles/
.*logs.* Match any path containing "logs"

Rule Precedence

Exclusion takes precedence over inclusion. If a file matches both an include and exclude rule, it is excluded.

M365 Mailbox Restriction

For Microsoft 365 Mailbox objects, you cannot use file type or directory filters.


What's Next

With datasets created and indexed, you're ready to start querying.

  • Querying & RAG — learn how Gaia's retrieval-augmented generation pipeline works.
  • Topic Explorer — explore themes and suggested questions from your indexed data.
  • Document Upload — add ad-hoc files to a dataset via the upload API.
  • Dataset Discovery — explore the indexed content hierarchy in detail.