Skip to content

Gaia + Unstructured.io Integration Patterns

This page is for developers who are new to these tools, and for AI agents automating documentation or code. It explains three practical ways to combine:

There is no single “Gaia + Unstructured” product. You connect them with your own scripts or services (often Python). The examples below show real Gaia SDK calls and realistic Unstructured HTTP calls; copying files to Cohesity S3 or SmartFiles is something your organization must implement with the right APIs or IT automation—we mark that clearly.


How to read this page

If you are… Start here
New to APIs Read Authentication, then Exhaustive Search.
Building a pipeline Skim Big picture, then pick one use case and trace the numbered steps.
An AI agent Use the Checklist for implementers at the end of each use case as a task list; environment variables are listed explicitly.

Big picture: who does what?

Text Only
┌─────────────────┐     ┌──────────────────────────┐     ┌─────────────────────┐
│  Cohesity Gaia  │     │  Your staging storage     │     │  Unstructured API   │
│  (search / RAG) │ ──► │  (S3, SmartFiles, share) │ ──► │  (partition / chunk) │
└─────────────────┘     └──────────────────────────┘     └─────────────────────┘
        │                          │                              │
        │                          │                              ▼
        │                          │                    JSON elements, chunks
        │                          │                    for DB, search, or
        │                          │                    upload back to Gaia
        ▼                          ▼
   docId, filepath,          Same bytes the           Optional: write text
   metadata from             business rules allow      back via Gaia upload
   exhaustive search
  • Gaia answers: “Which documents match this idea?” and returns document identifiers and snippets.
  • Staging storage answers: “Where can Unstructured read the full file bytes?” Often that means copying objects from a backup or share into a bucket or folder your job can access.
  • Unstructured answers: “What text (and structure) is inside this file?” so downstream systems get clean, machine-friendly content.

Prerequisites (checklist)

  1. Gaia API key and base URL — see Authentication. Use GaiaClient.from_env() in Python.
  2. A Gaia dataset already indexed (your files or uploads).
  3. Unstructured API key — create one in the Unstructured platform; confirm the partition URL in their docs (host names can change).
  4. A plan for file access — Gaia returns metadata such as filename, filepath, and doc_id. Your team must map those to actual file copy operations (Cohesity APIs, object storage sync, etc.). The examples use a placeholder function copy_source_to_staging() so you can plug in org-specific logic.

Use case 1 — Search-driven, targeted ingestion

Goal: Run a natural-language or keyword search in Gaia to select only the files that matter, copy those files into a staging area (for example an S3 bucket or SmartFiles path that Unstructured can read), then partition them with Unstructured for a vector database, data lake, or a second Gaia dataset built from cleaned text.

When to use it: Product launches, audits, or migrations where processing every file in a share is too expensive, but processing Gaia-matched files is enough.

Flow

  1. Call exhaustive search on one dataset, paginate until you have all hits.
  2. Deduplicate by doc_id (or by path) so you do not process duplicates.
  3. For each unique document, resolve how to read the original bytes (from backup, export, or copy job).
  4. Copy bytes into staging (S3 prefix, NFS share, etc.).
  5. Call Unstructured on each staged file (or use their batch ingestion against that prefix).
  6. (Optional) Upload processed text or chunks back to Gaia with the document upload API if you want a derived searchable dataset.

Example: collect all search hits (Gaia) + stub staging + Unstructured partition

This example uses the Gaia SDK for exhaustive search. It uses httpx for Unstructured (same style as other examples in this repo). Replace copy_source_to_staging and the Unstructured URL with your real implementations.

Python
"""
Use case 1: Gaia exhaustive search → staging path → Unstructured partition.

Environment:
  GAIA_API_KEY, GAIA_BASE_URL (optional), GAIA_VERIFY_SSL (optional)
  UNSTRUCTURED_API_KEY
  UNSTRUCTURED_PARTITION_URL  # e.g. from Unstructured dashboard / docs

This script is educational: wire copy_source_to_staging() to your Cohesity / S3 workflow.
"""

from __future__ import annotations

import asyncio
import os
from pathlib import Path

import httpx
from gaia_sdk import GaiaClient

DATASET_NAME = "my-dataset"
SEARCH_QUERY = "quarterly revenue recognition policy"
STAGING_DIR = Path("./staging_unstructured")  # example: local folder Unstructured can read


async def fetch_all_documents(gaia: GaiaClient, dataset_name: str, query: str):
    """Page through exhaustive search until no pagination token remains."""
    token = None
    seen: set[str] = set()
    while True:
        result = await gaia.exhaustive_search(
            dataset_name=dataset_name,
            query=query,
            page_size=50,
            pagination_token=token,
        )
        batch = result.documents or []
        for doc in batch:
            key = doc.doc_id or doc.filepath or doc.filename or ""
            if key and key not in seen:
                seen.add(key)
                yield doc
        token = result.pagination_token
        if not token:
            break


async def copy_source_to_staging(doc, staging: Path) -> Path | None:
    """
    Map a Gaia Document to a file on disk your pipeline can open.

    Real systems might:
      - call Cohesity / DataProtect APIs to export the object
      - copy from an SMB share using filepath metadata
      - download from S3 using a mapped key

    Return a Path to the staged file, or None if the file could not be obtained.
    """
    # --- Replace this block with your organization's logic ---
    if not doc.filename:
        return None
    staging.mkdir(parents=True, exist_ok=True)
    dest = staging / doc.filename.replace("/", "_")
    if not dest.exists():
        # Placeholder: create an empty marker so the rest of the script runs in dev.
        dest.write_text(
            f"# Placeholder for {doc.filename}\n# doc_id={doc.doc_id}\n",
            encoding="utf-8",
        )
    return dest


async def partition_with_unstructured(file_path: Path, api_key: str, api_url: str) -> dict:
    """Send one file to Unstructured's partition endpoint (multipart upload)."""
    headers = {"unstructured-api-key": api_key}
    async with httpx.AsyncClient(timeout=120.0) as client:
        with file_path.open("rb") as f:
            files = {"files": (file_path.name, f, "application/octet-stream")}
            response = await client.post(api_url, headers=headers, files=files)
    response.raise_for_status()
    return response.json()


async def main() -> None:
    api_url = os.environ.get(
        "UNSTRUCTURED_PARTITION_URL",
        "https://api.unstructured.io/general/v0/general",
    )
    unst_key = os.environ["UNSTRUCTURED_API_KEY"]

    async with GaiaClient.from_env() as gaia:
        async for doc in fetch_all_documents(gaia, DATASET_NAME, SEARCH_QUERY):
            staged = await copy_source_to_staging(doc, STAGING_DIR)
            if staged is None:
                continue
            try:
                elements = await partition_with_unstructured(staged, unst_key, api_url)
            except httpx.HTTPError as exc:
                print(f"[skip] Unstructured error for {staged.name}: {exc}")
                continue
            print(f"[ok] {doc.filename}: {len(elements) if isinstance(elements, list) else 'response'} elements")

if __name__ == "__main__":
    asyncio.run(main())

Checklist for implementers (use case 1)

  • Confirm Unstructured partition URL and headers match current docs.
  • Replace placeholder copy_source_to_staging with a real Cohesity / S3 / share copy.
  • Add logging, retries, and rate limits for both Gaia and Unstructured.
  • If you upload results back to Gaia, associate uploads with a dataset per Document Upload.

Use case 2 — Matter-scoped normalization (compliance / e-discovery style)

Goal: Treat Gaia exhaustive search as the manifest of “everything that matches this matter or regulatory query.” Copy those files to a case landing zone, then run Unstructured to produce consistent JSON or text for review tools, redaction pipelines, or long-term archive.

When to use it: You need completeness (not just the top 10 chunks): same search model as Exhaustive Search, but the consumer is a legal or compliance workflow, not a chat UI.

Flow

  1. Run exhaustive search with a precisely worded queryString (and keep a record of the query for audit).
  2. Export total_count and paginate all pages.
  3. Write a manifest file (CSV/JSON) listing doc_id, filename, filepath, score.
  4. Copy files to the case bucket; run Unstructured in batch mode or per file.
  5. Store Unstructured output with the manifest for traceability.

Example: build a JSON Lines manifest

Python
"""Build manifest.jsonl from exhaustive search (use case 2)."""

import asyncio
import json
from datetime import datetime, timezone

from gaia_sdk import GaiaClient

MANIFEST_PATH = "matter_123_manifest.jsonl"


async def main() -> None:
    async with GaiaClient.from_env() as gaia:
        token = None
        with open(MANIFEST_PATH, "w", encoding="utf-8") as out:
            while True:
                result = await gaia.exhaustive_search(
                    dataset_name="legal-hold",
                    query="Project Aurora vendor communications 2024",
                    page_size=100,
                    pagination_token=token,
                )
                for doc in result.documents or []:
                    record = {
                        "doc_id": doc.doc_id,
                        "filename": doc.filename,
                        "filepath": doc.filepath,
                        "score": doc.score,
                        "exported_at": datetime.now(timezone.utc).isoformat(),
                    }
                    out.write(json.dumps(record, ensure_ascii=False) + "\n")
                token = result.pagination_token
                if not token:
                    break
    print(f"Wrote {MANIFEST_PATH}")


if __name__ == "__main__":
    asyncio.run(main())

Downstream jobs read matter_123_manifest.jsonl, copy each object to staging, and call Unstructured—the manifest is your contract between teams.

Checklist for implementers (use case 2)

  • Store the exact query string and queryUid (from the first response) in your audit log.
  • Agree who owns chain of custody for the case bucket.
  • Validate labeling in Unstructured output against your retention policy.

Use case 3 — Hybrid “high-fidelity” parsing for messy PDFs

Goal: Keep Gaia as the primary finder (semantic search and /ask). When a user (or agent) opens a specific source document that is likely scanned or layout-heavy, send only that file through Unstructured’s hi-res (or similar) strategy for better tables and layout, then feed that text into refinement, a second LLM call, or a custom UI.

When to use it: Chat answers are good, but financial tables, forms, or scanned PDFs need deeper parsing than chunk snippets alone.

Flow

  1. User asks Gaia /ask (or you run exhaustive search first).
  2. From response.documents, pick one doc_id / path the user cares about.
  3. Fetch the full file from your storage layer (same as use case 1).
  4. POST file to Unstructured with strategy=hi_res (or the option your Unstructured plan supports—check their docs).
  5. Pass extracted text into /ask/refine with selected doc IDs, or into your own summarization step.

Example: refine after enriching one document (conceptual)

You already have query_uid from the initial /ask. Refine & Feedback explains refine() in the SDK. Unstructured does not replace refine—it enriches what you know about that file before you ask Gaia to focus.

Python
import asyncio

from gaia_sdk import GaiaClient

async def ask_then_refine_with_enriched_context():
    async with GaiaClient.from_env() as gaia:
        first = await gaia.ask(
            ["contracts"],
            "What is the liability cap in the vendor agreement?",
        )
        query_uid = first.query_uid
        assert query_uid

        # 1) Obtain full file bytes for the top document (your storage layer).
        # 2) Run Unstructured hi_res → long_text
        long_text = "... text from Unstructured ..."

        refined = await gaia.refine(
            query_uid=query_uid,
            dataset_names=["contracts"],
            query=(
                "Using the following extracted text, cite the liability cap verbatim:\n\n"
                + long_text[:80000]
            ),
            doc_ids=[d.doc_id for d in (first.documents or []) if d.doc_id],
        )
        print(refined.response_string)


asyncio.run(ask_then_refine_with_enriched_context())

In production you would truncate or chunk long_text to fit model limits and policy.

Checklist for implementers (use case 3)

  • Measure latency: Unstructured hi_res is slower than plain text.
  • Avoid sending secrets in refined prompts—redact before combining with extracted text.
  • Confirm token limits for refine and for your LLM.

Security and data handling

  • API keys belong in environment variables or a secrets manager—never in frontend code or public repos.
  • Staging buckets should be encrypted and scoped (least privilege).
  • Unstructured may process PII—follow your DPA and data residency rules.


Unstructured is a trademark of its respective owner. This guide is informational and does not imply partnership or endorsement.