Understanding Datasets¶
Datasets are the core organizational unit in Gaia. A dataset is a named collection of indexed data from one or more Cohesity-managed sources. Every query you send to Gaia runs against one or more datasets.
What is a Dataset?¶
Think of a dataset as a searchable index over a slice of your enterprise data. When you create a dataset, you tell Gaia:
- Which data sources to include (backup snapshots, file shares, M365 mailboxes, etc.)
- A name to reference it by in API calls
Gaia then indexes the content — extracting text, chunking it, generating embeddings, and storing everything in its vector store. Once indexing is complete, the dataset is ready for queries.
┌─────────────────────┐
│ Data Sources │ ┌───────────────────────┐
│ │ │ Dataset │
│ • Backup snapshots │──────▶│ "financial-reports" │
│ • File shares │ │ │
│ • M365 mailboxes │ │ Status: ready │
│ • Uploaded files │ │ Objects: 1,247 │
│ │ │ Last indexed: 2h ago │
└─────────────────────┘ └───────────────────────┘
│
▼
┌────────────────┐
│ Vector Store │
│ (Embeddings) │
└────────────────┘
Dataset Lifecycle¶
Every dataset goes through the following stages:
1. Create¶
A new, empty dataset is created with a name and optional description.
2. Configure Sources¶
Data sources are attached to the dataset — these define what data Gaia should index.
3. Index¶
Gaia processes the data sources: extracts text from documents, splits content into chunks, generates vector embeddings, and populates the vector store. This can take minutes to hours depending on data volume.
4. Query¶
Once indexing completes, the dataset is in ready state and can be queried via /ask, /ask/stream, or /ask/exhaustive.
Re-index¶
When source data changes (new backups, updated files), you can trigger re-indexing to keep the dataset current.
Dataset States¶
| State | Description |
|---|---|
ready | Indexing is complete. The dataset is queryable. |
indexing | Data is currently being indexed. Partial queries may work. |
error | Indexing failed. Check the error details and data source configuration. |
Querying during indexing
You can query a dataset while it's in the indexing state. Gaia will search whatever has been indexed so far. Results may be incomplete until indexing finishes.
Types of Data Sources¶
Gaia can index data from any source that Cohesity manages:
| Source Type | Examples |
|---|---|
| Backup Snapshots | VM backups, NAS backups, database backups — query data from any point-in-time snapshot |
| File Shares | SMB/NFS shares protected by Cohesity |
| Microsoft 365 | Exchange mailboxes, OneDrive files, SharePoint sites |
| Databases | SQL Server, Oracle, and other databases backed up by Cohesity |
| Uploaded Files | Ad-hoc documents uploaded directly via the Gaia upload API |
Data stays in Cohesity
Gaia doesn't copy your data to an external service. Indexing and vector storage happen within the Cohesity platform, preserving your existing security perimeter.
API Operations¶
List All Datasets¶
Retrieve a list of all available datasets.
async with GaiaClient.from_env() as gaia:
datasets = await gaia.list_datasets()
for ds in datasets:
print(f"{ds.name:30s} status={ds.status} objects={ds.object_count}")
Response fields on each dataset:
| Field | Type | Description |
|---|---|---|
name | str | Unique dataset name |
status | str | Current state: ready, indexing, or error |
description | str? | Optional description |
objectCount | int? | Number of indexed objects |
createdAt | str? | ISO 8601 creation timestamp |
updatedAt | str? | ISO 8601 last-updated timestamp |
Get Dataset Details¶
Retrieve detailed information about a specific dataset, including its data sources and indexing statistics.
async with GaiaClient.from_env() as gaia:
details = await gaia.get_dataset("financial-reports")
print(f"Name: {details.name}")
print(f"Status: {details.status}")
print(f"Objects: {details.object_count}")
if details.data_sources:
print("\nData Sources:")
for src in details.data_sources:
print(f" • {src.get('type', 'unknown')}: {src.get('name', 'N/A')}")
if details.indexing_stats:
stats = details.indexing_stats
print(f"\nIndexing Stats:")
print(f" Total docs: {stats.get('totalDocuments', 'N/A')}")
print(f" Indexed docs: {stats.get('indexedDocuments', 'N/A')}")
print(f" Failed docs: {stats.get('failedDocuments', 'N/A')}")
Create a Dataset¶
Create a new dataset programmatically.
async with GaiaClient.from_env() as gaia:
result = await gaia.create_dataset(
name="incident-logs-2025",
description="Server and application logs from 2025 backups",
)
print(f"Created dataset: {result}")
Dataset creation is just step one
Creating a dataset via the API gives you an empty dataset. Data sources are typically configured through the Cohesity UI, though some configurations can also be set via the API. After configuring sources, you need to trigger indexing.
Trigger Indexing¶
Start or restart the indexing process for a dataset.
async with GaiaClient.from_env() as gaia:
result = await gaia.trigger_indexing("incident-logs-2025")
print(f"Indexing triggered: {result}")
Delete a Dataset¶
Remove a dataset and its indexed data.
async with GaiaClient.from_env() as gaia:
result = await gaia.delete_dataset("old-dataset")
print(f"Deleted: {result}")
Deletion is permanent
Deleting a dataset removes all indexed data (embeddings, chunks, metadata) from the vector store. The original source data in Cohesity is not affected.
Monitoring Indexing Progress¶
For large datasets, indexing can take a significant amount of time. Here's how to monitor progress:
import asyncio
from gaia_sdk import GaiaClient
async def wait_for_indexing(dataset_name: str, poll_interval: int = 30):
"""Poll a dataset until indexing is complete."""
async with GaiaClient.from_env() as gaia:
while True:
details = await gaia.get_dataset(dataset_name)
if details.status == "ready":
print(f"\n✓ Dataset '{dataset_name}' is ready!")
if details.indexing_stats:
stats = details.indexing_stats
print(f" Total documents: {stats.get('totalDocuments', 'N/A')}")
print(f" Indexed documents: {stats.get('indexedDocuments', 'N/A')}")
return details
if details.status == "error":
print(f"\n✗ Dataset '{dataset_name}' encountered an error.")
return details
# Still indexing
progress = ""
if details.indexing_stats:
stats = details.indexing_stats
total = stats.get("totalDocuments", 0)
indexed = stats.get("indexedDocuments", 0)
if total > 0:
pct = (indexed / total) * 100
progress = f" ({indexed}/{total} = {pct:.0f}%)"
print(f" ⟳ Indexing{progress}... checking again in {poll_interval}s")
await asyncio.sleep(poll_interval)
asyncio.run(wait_for_indexing("incident-logs-2025"))
Best Practices¶
Naming Conventions¶
Use clear, descriptive, lowercase names with hyphens:
| Good | Bad |
|---|---|
financial-reports-2025 | FinReports |
it-runbooks-prod | dataset1 |
hr-policies-global | HR stuff |
m365-ceo-mailbox | mailbox |
Dataset Organization¶
One dataset per domain or use case
Organize datasets by business domain or use case rather than by technical source. A dataset called incident-investigation that pulls from server logs, ticketing systems, and runbooks is more useful than three separate datasets you'd need to query individually.
- Group related data. Combine data sources that are commonly queried together into a single dataset.
- Separate by access control. If different user groups should access different data, use separate datasets and control access at the session/application level.
- Create temporal datasets. For compliance or forensics, create datasets scoped to specific time ranges (e.g.,
logs-q1-2025).
Refresh Schedules¶
- Static data (historical archives, compliance snapshots) — Index once, no refresh needed.
- Slowly changing data (policies, runbooks, documentation) — Re-index weekly or on change.
- Frequently changing data (logs, emails, tickets) — Re-index daily or trigger re-indexing when new backups complete.
Querying Multiple Datasets¶
You can query across multiple datasets in a single /ask call by passing multiple names:
async with GaiaClient.from_env() as gaia:
response = await gaia.ask(
dataset_names=["it-runbooks", "incident-logs-2025"],
query="How do we handle a database failover?",
)
Gaia merges results from all specified datasets and uses the combined context to generate the answer. This is powerful for cross-domain queries like "What was the root cause of the March 5 outage?" that might span infrastructure logs, runbooks, and incident tickets.
Performance consideration
Querying multiple datasets is slightly slower than querying a single dataset because Gaia searches each dataset's vector store independently and then merges the results. For latency-sensitive applications, prefer a single well-organized dataset.
Next Steps¶
Now that you understand datasets, you're ready to build:
- Datasets & Indexing — Deep dive into creating, configuring, and indexing datasets.
- Querying & RAG — Learn how to query your datasets with the
/askendpoint. - Backend with FastAPI — Wrap these operations in a proper backend.