Gaia + Unstructured.io Integration Patterns¶
This page is for developers who are new to these tools, and for AI agents automating documentation or code. It explains three practical ways to combine:
- Cohesity Gaia — semantic search and RAG over your indexed enterprise data (exhaustive search, document upload,
/ask, etc.). - Unstructured — turning PDFs, Office files, images, and more into structured elements (text blocks, tables-friendly output) via their Partition API and ingestion pipelines.
There is no single “Gaia + Unstructured” product. You connect them with your own scripts or services (often Python). The examples below show real Gaia SDK calls and realistic Unstructured HTTP calls; copying files to Cohesity S3 or SmartFiles is something your organization must implement with the right APIs or IT automation—we mark that clearly.
How to read this page¶
| If you are… | Start here |
|---|---|
| New to APIs | Read Authentication, then Exhaustive Search. |
| Building a pipeline | Skim Big picture, then pick one use case and trace the numbered steps. |
| An AI agent | Use the Checklist for implementers at the end of each use case as a task list; environment variables are listed explicitly. |
Big picture: who does what?¶
┌─────────────────┐ ┌──────────────────────────┐ ┌─────────────────────┐
│ Cohesity Gaia │ │ Your staging storage │ │ Unstructured API │
│ (search / RAG) │ ──► │ (S3, SmartFiles, share) │ ──► │ (partition / chunk) │
└─────────────────┘ └──────────────────────────┘ └─────────────────────┘
│ │ │
│ │ ▼
│ │ JSON elements, chunks
│ │ for DB, search, or
│ │ upload back to Gaia
▼ ▼
docId, filepath, Same bytes the Optional: write text
metadata from business rules allow back via Gaia upload
exhaustive search
- Gaia answers: “Which documents match this idea?” and returns document identifiers and snippets.
- Staging storage answers: “Where can Unstructured read the full file bytes?” Often that means copying objects from a backup or share into a bucket or folder your job can access.
- Unstructured answers: “What text (and structure) is inside this file?” so downstream systems get clean, machine-friendly content.
Prerequisites (checklist)¶
- Gaia API key and base URL — see Authentication. Use
GaiaClient.from_env()in Python. - A Gaia dataset already indexed (your files or uploads).
- Unstructured API key — create one in the Unstructured platform; confirm the partition URL in their docs (host names can change).
- A plan for file access — Gaia returns metadata such as
filename,filepath, anddoc_id. Your team must map those to actual file copy operations (Cohesity APIs, object storage sync, etc.). The examples use a placeholder functioncopy_source_to_staging()so you can plug in org-specific logic.
Use case 1 — Search-driven, targeted ingestion¶
Goal: Run a natural-language or keyword search in Gaia to select only the files that matter, copy those files into a staging area (for example an S3 bucket or SmartFiles path that Unstructured can read), then partition them with Unstructured for a vector database, data lake, or a second Gaia dataset built from cleaned text.
When to use it: Product launches, audits, or migrations where processing every file in a share is too expensive, but processing Gaia-matched files is enough.
Flow¶
- Call exhaustive search on one dataset, paginate until you have all hits.
- Deduplicate by
doc_id(or by path) so you do not process duplicates. - For each unique document, resolve how to read the original bytes (from backup, export, or copy job).
- Copy bytes into staging (S3 prefix, NFS share, etc.).
- Call Unstructured on each staged file (or use their batch ingestion against that prefix).
- (Optional) Upload processed text or chunks back to Gaia with the document upload API if you want a derived searchable dataset.
Example: collect all search hits (Gaia) + stub staging + Unstructured partition¶
This example uses the Gaia SDK for exhaustive search. It uses httpx for Unstructured (same style as other examples in this repo). Replace copy_source_to_staging and the Unstructured URL with your real implementations.
"""
Use case 1: Gaia exhaustive search → staging path → Unstructured partition.
Environment:
GAIA_API_KEY, GAIA_BASE_URL (optional), GAIA_VERIFY_SSL (optional)
UNSTRUCTURED_API_KEY
UNSTRUCTURED_PARTITION_URL # e.g. from Unstructured dashboard / docs
This script is educational: wire copy_source_to_staging() to your Cohesity / S3 workflow.
"""
from __future__ import annotations
import asyncio
import os
from pathlib import Path
import httpx
from gaia_sdk import GaiaClient
DATASET_NAME = "my-dataset"
SEARCH_QUERY = "quarterly revenue recognition policy"
STAGING_DIR = Path("./staging_unstructured") # example: local folder Unstructured can read
async def fetch_all_documents(gaia: GaiaClient, dataset_name: str, query: str):
"""Page through exhaustive search until no pagination token remains."""
token = None
seen: set[str] = set()
while True:
result = await gaia.exhaustive_search(
dataset_name=dataset_name,
query=query,
page_size=50,
pagination_token=token,
)
batch = result.documents or []
for doc in batch:
key = doc.doc_id or doc.filepath or doc.filename or ""
if key and key not in seen:
seen.add(key)
yield doc
token = result.pagination_token
if not token:
break
async def copy_source_to_staging(doc, staging: Path) -> Path | None:
"""
Map a Gaia Document to a file on disk your pipeline can open.
Real systems might:
- call Cohesity / DataProtect APIs to export the object
- copy from an SMB share using filepath metadata
- download from S3 using a mapped key
Return a Path to the staged file, or None if the file could not be obtained.
"""
# --- Replace this block with your organization's logic ---
if not doc.filename:
return None
staging.mkdir(parents=True, exist_ok=True)
dest = staging / doc.filename.replace("/", "_")
if not dest.exists():
# Placeholder: create an empty marker so the rest of the script runs in dev.
dest.write_text(
f"# Placeholder for {doc.filename}\n# doc_id={doc.doc_id}\n",
encoding="utf-8",
)
return dest
async def partition_with_unstructured(file_path: Path, api_key: str, api_url: str) -> dict:
"""Send one file to Unstructured's partition endpoint (multipart upload)."""
headers = {"unstructured-api-key": api_key}
async with httpx.AsyncClient(timeout=120.0) as client:
with file_path.open("rb") as f:
files = {"files": (file_path.name, f, "application/octet-stream")}
response = await client.post(api_url, headers=headers, files=files)
response.raise_for_status()
return response.json()
async def main() -> None:
api_url = os.environ.get(
"UNSTRUCTURED_PARTITION_URL",
"https://api.unstructured.io/general/v0/general",
)
unst_key = os.environ["UNSTRUCTURED_API_KEY"]
async with GaiaClient.from_env() as gaia:
async for doc in fetch_all_documents(gaia, DATASET_NAME, SEARCH_QUERY):
staged = await copy_source_to_staging(doc, STAGING_DIR)
if staged is None:
continue
try:
elements = await partition_with_unstructured(staged, unst_key, api_url)
except httpx.HTTPError as exc:
print(f"[skip] Unstructured error for {staged.name}: {exc}")
continue
print(f"[ok] {doc.filename}: {len(elements) if isinstance(elements, list) else 'response'} elements")
if __name__ == "__main__":
asyncio.run(main())
Checklist for implementers (use case 1)¶
- Confirm Unstructured partition URL and headers match current docs.
- Replace placeholder
copy_source_to_stagingwith a real Cohesity / S3 / share copy. - Add logging, retries, and rate limits for both Gaia and Unstructured.
- If you upload results back to Gaia, associate uploads with a dataset per Document Upload.
Use case 2 — Matter-scoped normalization (compliance / e-discovery style)¶
Goal: Treat Gaia exhaustive search as the manifest of “everything that matches this matter or regulatory query.” Copy those files to a case landing zone, then run Unstructured to produce consistent JSON or text for review tools, redaction pipelines, or long-term archive.
When to use it: You need completeness (not just the top 10 chunks): same search model as Exhaustive Search, but the consumer is a legal or compliance workflow, not a chat UI.
Flow¶
- Run exhaustive search with a precisely worded
queryString(and keep a record of the query for audit). - Export total_count and paginate all pages.
- Write a manifest file (CSV/JSON) listing
doc_id,filename,filepath,score. - Copy files to the case bucket; run Unstructured in batch mode or per file.
- Store Unstructured output with the manifest for traceability.
Example: build a JSON Lines manifest¶
"""Build manifest.jsonl from exhaustive search (use case 2)."""
import asyncio
import json
from datetime import datetime, timezone
from gaia_sdk import GaiaClient
MANIFEST_PATH = "matter_123_manifest.jsonl"
async def main() -> None:
async with GaiaClient.from_env() as gaia:
token = None
with open(MANIFEST_PATH, "w", encoding="utf-8") as out:
while True:
result = await gaia.exhaustive_search(
dataset_name="legal-hold",
query="Project Aurora vendor communications 2024",
page_size=100,
pagination_token=token,
)
for doc in result.documents or []:
record = {
"doc_id": doc.doc_id,
"filename": doc.filename,
"filepath": doc.filepath,
"score": doc.score,
"exported_at": datetime.now(timezone.utc).isoformat(),
}
out.write(json.dumps(record, ensure_ascii=False) + "\n")
token = result.pagination_token
if not token:
break
print(f"Wrote {MANIFEST_PATH}")
if __name__ == "__main__":
asyncio.run(main())
Downstream jobs read matter_123_manifest.jsonl, copy each object to staging, and call Unstructured—the manifest is your contract between teams.
Checklist for implementers (use case 2)¶
- Store the exact query string and queryUid (from the first response) in your audit log.
- Agree who owns chain of custody for the case bucket.
- Validate labeling in Unstructured output against your retention policy.
Use case 3 — Hybrid “high-fidelity” parsing for messy PDFs¶
Goal: Keep Gaia as the primary finder (semantic search and /ask). When a user (or agent) opens a specific source document that is likely scanned or layout-heavy, send only that file through Unstructured’s hi-res (or similar) strategy for better tables and layout, then feed that text into refinement, a second LLM call, or a custom UI.
When to use it: Chat answers are good, but financial tables, forms, or scanned PDFs need deeper parsing than chunk snippets alone.
Flow¶
- User asks Gaia
/ask(or you run exhaustive search first). - From
response.documents, pick onedoc_id/ path the user cares about. - Fetch the full file from your storage layer (same as use case 1).
- POST file to Unstructured with
strategy=hi_res(or the option your Unstructured plan supports—check their docs). - Pass extracted text into
/ask/refinewith selected doc IDs, or into your own summarization step.
Example: refine after enriching one document (conceptual)¶
You already have query_uid from the initial /ask. Refine & Feedback explains refine() in the SDK. Unstructured does not replace refine—it enriches what you know about that file before you ask Gaia to focus.
import asyncio
from gaia_sdk import GaiaClient
async def ask_then_refine_with_enriched_context():
async with GaiaClient.from_env() as gaia:
first = await gaia.ask(
["contracts"],
"What is the liability cap in the vendor agreement?",
)
query_uid = first.query_uid
assert query_uid
# 1) Obtain full file bytes for the top document (your storage layer).
# 2) Run Unstructured hi_res → long_text
long_text = "... text from Unstructured ..."
refined = await gaia.refine(
query_uid=query_uid,
dataset_names=["contracts"],
query=(
"Using the following extracted text, cite the liability cap verbatim:\n\n"
+ long_text[:80000]
),
doc_ids=[d.doc_id for d in (first.documents or []) if d.doc_id],
)
print(refined.response_string)
asyncio.run(ask_then_refine_with_enriched_context())
In production you would truncate or chunk long_text to fit model limits and policy.
Checklist for implementers (use case 3)¶
- Measure latency: Unstructured hi_res is slower than plain text.
- Avoid sending secrets in refined prompts—redact before combining with extracted text.
- Confirm token limits for refine and for your LLM.
Security and data handling¶
- API keys belong in environment variables or a secrets manager—never in frontend code or public repos.
- Staging buckets should be encrypted and scoped (least privilege).
- Unstructured may process PII—follow your DPA and data residency rules.
Related Gaia documentation¶
- Exhaustive Search
- Document Upload
- Refine & Feedback
- Gaia Client Library
- Gaia + LangChain (retriever)
External links (Unstructured)¶
Unstructured is a trademark of its respective owner. This guide is informational and does not imply partnership or endorsement.