Dataset Discovery¶
Gaia's Discovery API analyzes the contents of an indexed dataset and produces a hierarchical map of categories, topics, and suggested questions. This powers features like navigation trees, topic explorers, and "suggested questions" panels that help users understand what they can ask before typing a single word.
What is Discovery?¶
When a dataset is indexed, Gaia can analyze its contents and produce a discovery hierarchy — a tree of categories and topics that describes what the dataset contains, along with suggested questions for each topic.
Dataset: "quarterly-reports"
├── Financial Results
│ ├── Revenue & Growth → "What was total revenue in Q4?"
│ ├── Operating Expenses → "How did OpEx change year-over-year?"
│ └── Profitability → "What was the net income margin?"
├── Product Updates
│ ├── New Features → "What features launched in Q4?"
│ └── Roadmap → "What's planned for next quarter?"
└── Market Analysis
├── Competitive Landscape → "Who are our main competitors?"
└── Market Trends → "What trends are affecting our market?"
API: GET /dataset/{id}/discovery¶
Retrieve the discovery hierarchy for a dataset.
Request:
Query Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
numLevels | int | 1 | Number of hierarchy levels to return |
level | int | — | Filter to a specific level |
uuid | string | — | Filter to children of a specific node |
Response:
{
"discoveryResults": [
{
"uuid": "cat-001",
"name": "Financial Results",
"level": 0,
"description": "Quarterly financial performance data including revenue, expenses, and profitability.",
"suggestedQuestions": [
"What was total revenue in Q4?",
"How did operating expenses change?"
],
"children": [
{
"uuid": "topic-001",
"name": "Revenue & Growth",
"level": 1,
"description": "Revenue figures and growth metrics.",
"suggestedQuestions": [
"What was total revenue in Q4 2024?",
"What was the year-over-year revenue growth rate?"
]
}
]
}
]
}
Response Fields¶
| Field | Type | Description |
|---|---|---|
uuid | string | Unique ID for this discovery node |
name | string | Human-readable category/topic name |
level | int | Depth in the hierarchy (0 = top-level) |
description | string | Summary of what this category covers |
suggestedQuestions | string[] | Example questions users can ask |
children | array | Nested child nodes (if numLevels > 1) |
API: GET /dataset/{id}/discovery/{uuid}/summary¶
Get the full summary and all suggested questions for a specific discovery node.
GET /v2/mcm/gaia/dataset/{dataset_id}/discovery/{uuid}/summary?includeAllSuggestedQuestions=true
Response:
{
"uuid": "topic-001",
"name": "Revenue & Growth",
"description": "Detailed analysis of revenue streams, growth rates, and financial projections.",
"suggestedQuestions": [
"What was total revenue in Q4 2024?",
"What was the year-over-year revenue growth rate?",
"Which business segment had the highest revenue?",
"What is the revenue forecast for next quarter?"
]
}
API: POST /dataset/{id}/discovery/{uuid}/generate-more-questions¶
Generate additional suggested questions for a specific topic. Useful when the initial set doesn't cover the user's area of interest.
Response:
{
"suggestedQuestions": [
"How does revenue break down by region?",
"What was the average deal size in Q4?",
"How did subscription revenue compare to one-time sales?"
]
}
Backend Implementation¶
Using the SDK¶
from fastapi import APIRouter, Depends
from gaia_sdk import GaiaClient
from backend.api.dependencies import get_gaia_client
router = APIRouter()
@router.get("/dataset/{dataset_id}/discovery", tags=["Discovery"])
async def get_discovery(
dataset_id: str,
client: GaiaClient = Depends(get_gaia_client),
):
"""Get the discovery hierarchy for a dataset."""
return await client.get_discovery(dataset_id)
Using raw httpx¶
For more control — filtering by level, fetching summaries, generating questions:
import httpx
from backend.settings import get_settings
async def get_discovery_tree(
api_key: str,
dataset_id: str,
num_levels: int = 2,
) -> dict:
"""Fetch the full discovery tree for a dataset."""
settings = get_settings()
url = f"{settings.gaia_base_url}/dataset/{dataset_id}/discovery"
async with httpx.AsyncClient(verify=settings.gaia_verify_ssl) as client:
response = await client.get(
url,
headers={"apiKey": api_key},
params={"numLevels": num_levels},
)
response.raise_for_status()
return response.json()
async def get_node_summary(
api_key: str,
dataset_id: str,
node_uuid: str,
) -> dict:
"""Get the summary and questions for a specific discovery node."""
settings = get_settings()
url = f"{settings.gaia_base_url}/dataset/{dataset_id}/discovery/{node_uuid}/summary"
async with httpx.AsyncClient(verify=settings.gaia_verify_ssl) as client:
response = await client.get(
url,
headers={"apiKey": api_key},
params={"includeAllSuggestedQuestions": True},
)
response.raise_for_status()
return response.json()
async def generate_more_questions(
api_key: str,
dataset_id: str,
node_uuid: str,
) -> list[str]:
"""Generate additional questions for a discovery node."""
settings = get_settings()
url = f"{settings.gaia_base_url}/dataset/{dataset_id}/discovery/{node_uuid}/generate-more-questions"
async with httpx.AsyncClient(verify=settings.gaia_verify_ssl) as client:
response = await client.post(
url,
headers={"apiKey": api_key, "Content-Type": "application/json"},
)
response.raise_for_status()
data = response.json()
return data.get("suggestedQuestions", [])
Building Discovery UIs¶
Topic Navigation Tree¶
A collapsible tree that lets users browse the dataset's content hierarchy:
// src/components/DiscoveryTree.tsx
import { useState, useEffect } from "react";
import { api } from "../api/client";
interface DiscoveryNode {
uuid: string;
name: string;
level: number;
description?: string;
suggestedQuestions?: string[];
children?: DiscoveryNode[];
}
interface Props {
datasetId: string;
onSelectQuestion: (question: string) => void;
}
export function DiscoveryTree({ datasetId, onSelectQuestion }: Props) {
const [tree, setTree] = useState<DiscoveryNode[]>([]);
const [expanded, setExpanded] = useState<Set<string>>(new Set());
useEffect(() => {
api
.get<{ discoveryResults: DiscoveryNode[] }>(
`/dataset/${datasetId}/discovery?numLevels=2`,
)
.then((data) => setTree(data.discoveryResults ?? []));
}, [datasetId]);
const toggle = (uuid: string) => {
setExpanded((prev) => {
const next = new Set(prev);
if (next.has(uuid)) next.delete(uuid);
else next.add(uuid);
return next;
});
};
const renderNode = (node: DiscoveryNode, depth: number = 0) => (
<div key={node.uuid} style={{ marginLeft: depth * 16 }}>
<button
onClick={() => toggle(node.uuid)}
className="flex items-center gap-2 py-1 text-sm font-medium hover:text-purple-600"
>
<span>{expanded.has(node.uuid) ? "▼" : "▶"}</span>
{node.name}
</button>
{expanded.has(node.uuid) && (
<div className="ml-6 space-y-1">
{node.description && (
<p className="text-xs text-gray-500">{node.description}</p>
)}
{node.suggestedQuestions?.map((q, i) => (
<button
key={i}
onClick={() => onSelectQuestion(q)}
className="block text-sm text-purple-600 hover:underline"
>
{q}
</button>
))}
{node.children?.map((child) => renderNode(child, depth + 1))}
</div>
)}
</div>
);
return (
<div className="space-y-1">
<h3 className="font-semibold text-sm mb-2">Explore Topics</h3>
{tree.map((node) => renderNode(node))}
</div>
);
}
Suggested Questions Panel¶
A simpler component that shows top-level suggested questions as clickable pills:
// src/components/SuggestedQuestions.tsx
interface Props {
questions: string[];
onSelect: (question: string) => void;
}
export function SuggestedQuestions({ questions, onSelect }: Props) {
if (questions.length === 0) return null;
return (
<div className="space-y-2">
<h3 className="text-sm font-semibold text-gray-700">Try asking:</h3>
<div className="flex flex-wrap gap-2">
{questions.map((q, i) => (
<button
key={i}
onClick={() => onSelect(q)}
className="rounded-full border border-purple-200 bg-purple-50 px-3 py-1 text-sm text-purple-700 hover:bg-purple-100 transition"
>
{q}
</button>
))}
</div>
</div>
);
}
Use Case: Intelligent Dataset Routing¶
Discovery data can also be used programmatically. If your app supports multiple datasets, you can use discovery metadata to automatically route questions to the most relevant dataset — as implemented in the reference app's query_service:
async def find_best_dataset(
question: str,
available_datasets: list[str],
gaia_client: GaiaClient,
) -> list[str]:
"""Use discovery metadata to match a question to the best dataset(s)."""
dataset_metadata = []
for ds_name in available_datasets:
try:
discovery = await gaia_client.get_discovery(ds_name)
results = discovery.get("discoveryResults", [])
all_questions = []
descriptions = []
for node in results:
descriptions.append(node.get("description", ""))
all_questions.extend(node.get("suggestedQuestions", []))
dataset_metadata.append({
"name": ds_name,
"description": " | ".join(descriptions[:3]),
"suggested_questions": all_questions[:10],
})
except Exception:
dataset_metadata.append({"name": ds_name})
# Use a helper LLM or heuristic to match
# (see Helper LLM Integration chapter)
return match_datasets(question, dataset_metadata)
Next Steps¶
- Sensitive Data Handling — Policies for masking PII in query results.
- Helper LLM Integration — Use discovery data with a secondary LLM.
- Refine & Feedback — Follow up discovery-guided queries with refinement.