Skip to content

Dataset Discovery

Gaia's Discovery API analyzes the contents of an indexed dataset and produces a hierarchical map of categories, topics, and suggested questions. This powers features like navigation trees, topic explorers, and "suggested questions" panels that help users understand what they can ask before typing a single word.


What is Discovery?

When a dataset is indexed, Gaia can analyze its contents and produce a discovery hierarchy — a tree of categories and topics that describes what the dataset contains, along with suggested questions for each topic.

Text Only
Dataset: "quarterly-reports"
├── Financial Results
│   ├── Revenue & Growth         → "What was total revenue in Q4?"
│   ├── Operating Expenses       → "How did OpEx change year-over-year?"
│   └── Profitability            → "What was the net income margin?"
├── Product Updates
│   ├── New Features             → "What features launched in Q4?"
│   └── Roadmap                  → "What's planned for next quarter?"
└── Market Analysis
    ├── Competitive Landscape    → "Who are our main competitors?"
    └── Market Trends            → "What trends are affecting our market?"

API: GET /dataset/{id}/discovery

Retrieve the discovery hierarchy for a dataset.

Request:

Text Only
GET /v2/mcm/gaia/dataset/{dataset_id}/discovery?numLevels=2

Query Parameters:

Parameter Type Default Description
numLevels int 1 Number of hierarchy levels to return
level int Filter to a specific level
uuid string Filter to children of a specific node

Response:

JSON
{
  "discoveryResults": [
    {
      "uuid": "cat-001",
      "name": "Financial Results",
      "level": 0,
      "description": "Quarterly financial performance data including revenue, expenses, and profitability.",
      "suggestedQuestions": [
        "What was total revenue in Q4?",
        "How did operating expenses change?"
      ],
      "children": [
        {
          "uuid": "topic-001",
          "name": "Revenue & Growth",
          "level": 1,
          "description": "Revenue figures and growth metrics.",
          "suggestedQuestions": [
            "What was total revenue in Q4 2024?",
            "What was the year-over-year revenue growth rate?"
          ]
        }
      ]
    }
  ]
}

Response Fields

Field Type Description
uuid string Unique ID for this discovery node
name string Human-readable category/topic name
level int Depth in the hierarchy (0 = top-level)
description string Summary of what this category covers
suggestedQuestions string[] Example questions users can ask
children array Nested child nodes (if numLevels > 1)

API: GET /dataset/{id}/discovery/{uuid}/summary

Get the full summary and all suggested questions for a specific discovery node.

Text Only
GET /v2/mcm/gaia/dataset/{dataset_id}/discovery/{uuid}/summary?includeAllSuggestedQuestions=true

Response:

JSON
{
  "uuid": "topic-001",
  "name": "Revenue & Growth",
  "description": "Detailed analysis of revenue streams, growth rates, and financial projections.",
  "suggestedQuestions": [
    "What was total revenue in Q4 2024?",
    "What was the year-over-year revenue growth rate?",
    "Which business segment had the highest revenue?",
    "What is the revenue forecast for next quarter?"
  ]
}

API: POST /dataset/{id}/discovery/{uuid}/generate-more-questions

Generate additional suggested questions for a specific topic. Useful when the initial set doesn't cover the user's area of interest.

Text Only
POST /v2/mcm/gaia/dataset/{dataset_id}/discovery/{uuid}/generate-more-questions

Response:

JSON
{
  "suggestedQuestions": [
    "How does revenue break down by region?",
    "What was the average deal size in Q4?",
    "How did subscription revenue compare to one-time sales?"
  ]
}

Backend Implementation

Using the SDK

Python
from fastapi import APIRouter, Depends
from gaia_sdk import GaiaClient
from backend.api.dependencies import get_gaia_client

router = APIRouter()


@router.get("/dataset/{dataset_id}/discovery", tags=["Discovery"])
async def get_discovery(
    dataset_id: str,
    client: GaiaClient = Depends(get_gaia_client),
):
    """Get the discovery hierarchy for a dataset."""
    return await client.get_discovery(dataset_id)

Using raw httpx

For more control — filtering by level, fetching summaries, generating questions:

Python
import httpx
from backend.settings import get_settings


async def get_discovery_tree(
    api_key: str,
    dataset_id: str,
    num_levels: int = 2,
) -> dict:
    """Fetch the full discovery tree for a dataset."""
    settings = get_settings()
    url = f"{settings.gaia_base_url}/dataset/{dataset_id}/discovery"

    async with httpx.AsyncClient(verify=settings.gaia_verify_ssl) as client:
        response = await client.get(
            url,
            headers={"apiKey": api_key},
            params={"numLevels": num_levels},
        )
        response.raise_for_status()
        return response.json()


async def get_node_summary(
    api_key: str,
    dataset_id: str,
    node_uuid: str,
) -> dict:
    """Get the summary and questions for a specific discovery node."""
    settings = get_settings()
    url = f"{settings.gaia_base_url}/dataset/{dataset_id}/discovery/{node_uuid}/summary"

    async with httpx.AsyncClient(verify=settings.gaia_verify_ssl) as client:
        response = await client.get(
            url,
            headers={"apiKey": api_key},
            params={"includeAllSuggestedQuestions": True},
        )
        response.raise_for_status()
        return response.json()


async def generate_more_questions(
    api_key: str,
    dataset_id: str,
    node_uuid: str,
) -> list[str]:
    """Generate additional questions for a discovery node."""
    settings = get_settings()
    url = f"{settings.gaia_base_url}/dataset/{dataset_id}/discovery/{node_uuid}/generate-more-questions"

    async with httpx.AsyncClient(verify=settings.gaia_verify_ssl) as client:
        response = await client.post(
            url,
            headers={"apiKey": api_key, "Content-Type": "application/json"},
        )
        response.raise_for_status()
        data = response.json()
        return data.get("suggestedQuestions", [])

Building Discovery UIs

Topic Navigation Tree

A collapsible tree that lets users browse the dataset's content hierarchy:

TSX
// src/components/DiscoveryTree.tsx

import { useState, useEffect } from "react";
import { api } from "../api/client";

interface DiscoveryNode {
  uuid: string;
  name: string;
  level: number;
  description?: string;
  suggestedQuestions?: string[];
  children?: DiscoveryNode[];
}

interface Props {
  datasetId: string;
  onSelectQuestion: (question: string) => void;
}

export function DiscoveryTree({ datasetId, onSelectQuestion }: Props) {
  const [tree, setTree] = useState<DiscoveryNode[]>([]);
  const [expanded, setExpanded] = useState<Set<string>>(new Set());

  useEffect(() => {
    api
      .get<{ discoveryResults: DiscoveryNode[] }>(
        `/dataset/${datasetId}/discovery?numLevels=2`,
      )
      .then((data) => setTree(data.discoveryResults ?? []));
  }, [datasetId]);

  const toggle = (uuid: string) => {
    setExpanded((prev) => {
      const next = new Set(prev);
      if (next.has(uuid)) next.delete(uuid);
      else next.add(uuid);
      return next;
    });
  };

  const renderNode = (node: DiscoveryNode, depth: number = 0) => (
    <div key={node.uuid} style={{ marginLeft: depth * 16 }}>
      <button
        onClick={() => toggle(node.uuid)}
        className="flex items-center gap-2 py-1 text-sm font-medium hover:text-purple-600"
      >
        <span>{expanded.has(node.uuid) ? "▼" : "▶"}</span>
        {node.name}
      </button>

      {expanded.has(node.uuid) && (
        <div className="ml-6 space-y-1">
          {node.description && (
            <p className="text-xs text-gray-500">{node.description}</p>
          )}
          {node.suggestedQuestions?.map((q, i) => (
            <button
              key={i}
              onClick={() => onSelectQuestion(q)}
              className="block text-sm text-purple-600 hover:underline"
            >
              {q}
            </button>
          ))}
          {node.children?.map((child) => renderNode(child, depth + 1))}
        </div>
      )}
    </div>
  );

  return (
    <div className="space-y-1">
      <h3 className="font-semibold text-sm mb-2">Explore Topics</h3>
      {tree.map((node) => renderNode(node))}
    </div>
  );
}

Suggested Questions Panel

A simpler component that shows top-level suggested questions as clickable pills:

TSX
// src/components/SuggestedQuestions.tsx

interface Props {
  questions: string[];
  onSelect: (question: string) => void;
}

export function SuggestedQuestions({ questions, onSelect }: Props) {
  if (questions.length === 0) return null;

  return (
    <div className="space-y-2">
      <h3 className="text-sm font-semibold text-gray-700">Try asking:</h3>
      <div className="flex flex-wrap gap-2">
        {questions.map((q, i) => (
          <button
            key={i}
            onClick={() => onSelect(q)}
            className="rounded-full border border-purple-200 bg-purple-50 px-3 py-1 text-sm text-purple-700 hover:bg-purple-100 transition"
          >
            {q}
          </button>
        ))}
      </div>
    </div>
  );
}

Use Case: Intelligent Dataset Routing

Discovery data can also be used programmatically. If your app supports multiple datasets, you can use discovery metadata to automatically route questions to the most relevant dataset — as implemented in the reference app's query_service:

Python
async def find_best_dataset(
    question: str,
    available_datasets: list[str],
    gaia_client: GaiaClient,
) -> list[str]:
    """Use discovery metadata to match a question to the best dataset(s)."""
    dataset_metadata = []

    for ds_name in available_datasets:
        try:
            discovery = await gaia_client.get_discovery(ds_name)
            results = discovery.get("discoveryResults", [])
            all_questions = []
            descriptions = []
            for node in results:
                descriptions.append(node.get("description", ""))
                all_questions.extend(node.get("suggestedQuestions", []))
            dataset_metadata.append({
                "name": ds_name,
                "description": " | ".join(descriptions[:3]),
                "suggested_questions": all_questions[:10],
            })
        except Exception:
            dataset_metadata.append({"name": ds_name})

    # Use a helper LLM or heuristic to match
    # (see Helper LLM Integration chapter)
    return match_datasets(question, dataset_metadata)

Next Steps