Datasets & Indexing¶
Datasets are the foundational building block of every Gaia application. A dataset is a named collection of data sources — file shares, NAS volumes, object stores, or uploaded documents — whose contents are chunked, embedded, and stored in a vector index so Gaia can retrieve them at query time.
This page walks through the full lifecycle: create → configure → index → monitor → query.
Key Concepts¶
| Concept | Description |
|---|---|
| Dataset | A named container that groups one or more data sources under a single queryable index. |
| Data source | A reference to an external location (e.g., a Cohesity View, NFS export, S3 bucket) whose files are ingested. |
| Indexing | The background process that reads source files, splits them into chunks, generates vector embeddings, and writes them to the search index. |
| Indexing window | An optional schedule that limits when indexing runs (e.g., off-peak hours only). |
| Incremental update | Re-indexing that processes only new or changed files since the last run. |
Listing Datasets¶
Before creating anything, check what already exists.
Filter by prefix
list_datasets(prefix="finance-") returns only datasets whose names start with the given string — useful when you partition datasets by team or domain.
Creating a Dataset¶
A new dataset needs at minimum a name and one or more data sources.
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
result = await gaia.create_dataset(
name="engineering-docs",
description="Internal engineering wiki and design documents",
dataSources=[
{
"sourceType": "CohesityView",
"viewName": "eng-wiki",
"includePaths": ["/design-docs", "/runbooks"],
"excludePaths": ["/archive"],
}
],
)
print(result)
curl -s -X POST \
-H "apiKey: $GAIA_API_KEY" \
-H "Content-Type: application/json" \
"https://helios.cohesity.com/v2/mcm/gaia/datasets" \
-d '{
"name": "engineering-docs",
"description": "Internal engineering wiki and design documents",
"dataSources": [
{
"sourceType": "CohesityView",
"viewName": "eng-wiki",
"includePaths": ["/design-docs", "/runbooks"],
"excludePaths": ["/archive"]
}
]
}'
Data Source Options¶
| Field | Type | Description |
|---|---|---|
sourceType | string | "CohesityView", "NFS", "SMB", "S3", etc. |
viewName | string | Name of the Cohesity View (for CohesityView type). |
includePaths | string[] | Paths within the source to include (default: all). |
excludePaths | string[] | Paths to exclude from indexing. |
fileFilters | object | Fine-grained filters by extension, size, or modified date. |
Configuring Indexing¶
Indexing configuration controls what gets indexed and when.
File Filters¶
Restrict which files are ingested based on extension or size:
await gaia.create_dataset(
name="legal-contracts",
dataSources=[
{
"sourceType": "CohesityView",
"viewName": "legal-share",
"fileFilters": {
"includeExtensions": [".pdf", ".docx"],
"excludeExtensions": [".tmp", ".bak"],
"maxFileSizeMB": 100,
},
}
],
)
Indexing Window¶
An indexing window limits processing to specific hours so production workloads remain unaffected:
{
"indexingWindow": {
"enabled": true,
"startHour": 22,
"endHour": 6,
"timezone": "America/Los_Angeles"
}
}
Clock skew
Ensure the cluster's NTP configuration is accurate — a clock drift of more than a few minutes can cause indexing to start or stop outside the expected window.
Triggering Indexing¶
After creating a dataset you must trigger indexing to start the initial ingestion. Subsequent re-index runs pick up only new and modified files (incremental updates).
Monitoring Indexing Progress¶
Poll GET /dataset/{name}/details and inspect the indexingStats object to track progress.
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
details = await gaia.get_dataset("engineering-docs")
print(f"Status : {details.status}")
print(f"Objects: {details.object_count}")
if details.indexing_stats:
stats = details.indexing_stats
print(f" Total files : {stats.get('totalFiles', 'N/A')}")
print(f" Indexed files : {stats.get('indexedFiles', 'N/A')}")
print(f" Failed files : {stats.get('failedFiles', 'N/A')}")
print(f" Index size : {stats.get('indexSizeBytes', 'N/A')} bytes")
indexingStats Reference¶
| Field | Type | Description |
|---|---|---|
totalFiles | int | Total number of files discovered in the data source. |
indexedFiles | int | Files successfully indexed so far. |
failedFiles | int | Files that failed during processing. |
indexSizeBytes | int | Size of the vector index on disk. |
lastIndexedAt | string | ISO-8601 timestamp of the last completed indexing run. |
status | string | "Running", "Completed", "Failed", etc. |
Polling for Indexing Completion¶
For automation workflows you'll want to poll until indexing finishes:
import asyncio
from gaia_sdk import GaiaClient
async def wait_for_indexing(gaia: GaiaClient, name: str, poll_seconds: int = 10):
"""Poll until dataset indexing reaches a terminal state."""
while True:
details = await gaia.get_dataset(name)
status = (details.indexing_stats or {}).get("status", "Unknown")
indexed = (details.indexing_stats or {}).get("indexedFiles", 0)
total = (details.indexing_stats or {}).get("totalFiles", 0)
print(f"[{name}] {status} — {indexed}/{total} files")
if status in ("Completed", "Failed"):
return details
await asyncio.sleep(poll_seconds)
Re-indexing and Incremental Updates¶
Gaia supports incremental indexing by default. When you trigger indexing on an already-indexed dataset:
- New files are chunked, embedded, and added to the index.
- Modified files (detected via checksum or last-modified timestamp) are re-processed, and stale chunks are replaced.
- Deleted files have their chunks removed from the index.
To force a full re-index (e.g., after changing chunking settings), pass the appropriate flag:
curl -s -X POST \
-H "apiKey: $GAIA_API_KEY" \
-H "Content-Type: application/json" \
"https://helios.cohesity.com/v2/mcm/gaia/dataset/engineering-docs/index" \
-d '{"fullReindex": true}'
Full re-index duration
A full re-index processes every file regardless of change status. For large datasets this can take significantly longer — schedule it during your indexing window when possible.
Dataset Discovery¶
Once a dataset is indexed, the discovery endpoint provides a hierarchical view of all ingested content — useful for building file browsers or verifying that the right documents were picked up.
GET /dataset/{id}/discovery¶
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
tree = await gaia.get_discovery("engineering-docs")
def print_tree(node, indent=0):
prefix = " " * indent
print(f"{prefix}{node.get('name', '/')} ({node.get('type', 'dir')})")
for child in node.get("children", []):
print_tree(child, indent + 1)
print_tree(tree)
The response is a recursive tree structure:
{
"name": "/",
"type": "directory",
"children": [
{
"name": "design-docs",
"type": "directory",
"children": [
{ "name": "architecture.pdf", "type": "file", "sizeBytes": 204800 },
{ "name": "api-spec.md", "type": "file", "sizeBytes": 15360 }
]
}
]
}
Complete Example: Create → Index → Poll → Query¶
Putting it all together — a single script that creates a dataset, triggers indexing, waits for completion, and runs a test query.
import asyncio
from gaia_sdk import GaiaClient
DATASET_NAME = "quarterly-reports"
async def main():
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
# 1. Create the dataset
print("Creating dataset …")
await gaia.create_dataset(
name=DATASET_NAME,
description="FY25 quarterly earnings reports",
dataSources=[
{
"sourceType": "CohesityView",
"viewName": "finance-share",
"includePaths": ["/reports/FY25"],
"fileFilters": {
"includeExtensions": [".pdf", ".xlsx"],
},
}
],
)
# 2. Trigger indexing
print("Triggering indexing …")
await gaia.trigger_indexing(DATASET_NAME)
# 3. Poll until complete
while True:
details = await gaia.get_dataset(DATASET_NAME)
stats = details.indexing_stats or {}
status = stats.get("status", "Unknown")
indexed = stats.get("indexedFiles", 0)
total = stats.get("totalFiles", 0)
print(f" Indexing: {status} — {indexed}/{total} files")
if status in ("Completed", "Failed"):
break
await asyncio.sleep(10)
if status == "Failed":
print("Indexing failed. Check the cluster logs.")
return
# 4. Run a test query
print("\nRunning test query …")
response = await gaia.ask(
dataset_names=[DATASET_NAME],
query="What was the Q3 revenue?",
)
print(f"Answer: {response.response_string}")
if response.documents:
print(f"\nSource documents ({len(response.documents)}):")
for doc in response.documents:
print(f" • {doc.filename} (score: {doc.score:.3f})")
asyncio.run(main())
Best Practices¶
Naming conventions
Use lowercase, hyphenated names like engineering-docs or finance-q3-2025. Consistent naming makes filtering with list_datasets(prefix=...) straightforward.
Small, focused datasets
Prefer many small datasets over one monolithic one. Smaller datasets index faster, return more relevant results, and are easier to manage.
Include/Exclude paths
Always scope your data sources with includePaths and excludePaths to avoid ingesting temporary files, build artifacts, or irrelevant directories.
Monitor after every indexing run
Always check failedFiles after indexing completes. A non-zero count may indicate unsupported formats, permission issues, or corrupt files.
Platform Limits & Considerations¶
Official Gaia Limits
These limits are enforced by the Gaia platform. Plan your dataset strategy around them.
| Limit | Value |
|---|---|
| Max datasets per user account | 50 |
| Max objects per dataset | 1,000 |
| Max extracted text per dataset | 20 GB (status becomes "Warning" when exceeded) |
| Max file size | 100 MB |
| Dataset scope | Single cluster only (cannot span multiple clusters) |
| Snapshot types | Backup snapshots only (replication and archival not supported) |
Do Not Delete Indexed Snapshots
If a backup snapshot is indexed and utilized by a dataset, do not delete the snapshot from the Cohesity cluster. Deleting it will prevent Gaia from answering queries for that dataset.
Supported File Types¶
Gaia can index the following file formats:
| Category | Extensions |
|---|---|
| Microsoft Office | .doc, .docx, .xls, .xlsx, .ppt, .pptx |
| PDF & Document | .pdf, .odf, .rtf |
| Text & Web | .txt, .html, .xml |
Indexing Modes¶
When creating a dataset, you choose an indexing mode that determines how and when data is processed:
Continuous Indexing¶
Gaia automatically checks for new snapshots every hour. When a new snapshot is available, indexing begins automatically.
- The dataset always reflects the latest data
- If a file is removed in a newer snapshot, it's no longer available in the dataset
- Topic Explorer topics refresh every 7 days
- Requires Cohesity cluster version 7.2.2_u1 or later
await gaia.create_dataset(
name="live-reports",
indexing_mode="continuous",
dataSources=[{"sourceType": "CohesityView", "viewName": "reports"}],
)
One-Time Indexing¶
Index a specific point-in-time snapshot:
- Most recent snapshot — Indexes the latest available backup
- Custom date range — Indexes data within a specific time window (currently available for Microsoft 365 object types only)
On-Demand Indexing¶
For datasets created with Continuous Indexing, you can trigger an immediate index refresh:
Inclusion & Exclusion Rules¶
When creating datasets, you can filter what gets indexed using file types and directory paths.
File Type Filters¶
Include or exclude specific file extensions. By default all supported types are included.
Directory Path Rules¶
Use regex patterns to include or exclude specific directory paths:
| Pattern | Effect |
|---|---|
/MyFiles/Payroll/.* | Index all files in /MyFiles/Payroll/ and subdirectories |
/myFiles/Payroll.* | Index directories starting with "Payroll" under /myFiles/ |
.*logs.* | Match any path containing "logs" |
Rule Precedence
Exclusion takes precedence over inclusion. If a file matches both an include and exclude rule, it is excluded.
M365 Mailbox Restriction
For Microsoft 365 Mailbox objects, you cannot use file type or directory filters.
What's Next¶
With datasets created and indexed, you're ready to start querying.
- Querying & RAG — learn how Gaia's retrieval-augmented generation pipeline works.
- Topic Explorer — explore themes and suggested questions from your indexed data.
- Document Upload — add ad-hoc files to a dataset via the upload API.
- Dataset Discovery — explore the indexed content hierarchy in detail.