Document Upload¶

Not all data lives in a Cohesity View or NAS share. The document upload API lets you push files directly into Gaia — ideal for ad-hoc documents, user uploads in web apps, or CI/CD pipelines that inject generated content into a dataset.

When to Use Document Upload¶

Scenario	Use Upload API?
Ad-hoc files from users (drag-and-drop in a web UI)	Yes
Generated reports from a nightly pipeline	Yes
Large file share with thousands of files	No — use a data source in the dataset config
Cohesity-managed backup data	No — add the View as a data source

Upload is best for small to moderate volumes of individual files. For large-scale ingestion, configure data sources directly in the dataset.

How It Works¶

Document upload is a two-step process:

sequenceDiagram
    participant Client
    participant Gaia

    Client->>Gaia: POST /upload-session
    Gaia-->>Client: { uploadSessionId: "sess-abc" }

    Client->>Gaia: POST /upload-file (file 1)
    Gaia-->>Client: 200 OK

    Client->>Gaia: POST /upload-file (file 2)
    Gaia-->>Client: 200 OK

    Note over Client,Gaia: Associate uploads with a dataset<br/>(create or update dataset config)

Create an upload session — returns a uploadSessionId that groups related uploads together.
Upload files — send each file as raw binary with metadata headers.

After uploading, the files need to be associated with a dataset (either during creation or by updating an existing dataset) so they become part of the searchable index.

Step 1: Create an Upload Session¶

`POST /upload-session`¶

Python SDKcURL

Python

import asyncio
from gaia_sdk import GaiaClient

async def main():
    async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
        session = await gaia.create_upload_session()
        print(f"Session ID: {session.upload_session_id}")

asyncio.run(main())

Bash

curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  -H "Content-Type: application/json" \
  "https://helios.cohesity.com/v2/mcm/gaia/upload-session"

Response:

JSON

{
  "uploadSessionId": "sess-a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}

Session lifetime

Upload sessions remain active for a limited time (typically 24 hours). Complete all file uploads within this window.

Step 2: Upload Files¶

`POST /upload-file`¶

Send the file as raw binary in the request body with metadata in headers.

Required Headers¶

Header	Description
`X-Upload-Session-ID`	The session ID from Step 1.
`X-File-Name`	The file name (e.g., `report-q3.pdf`).
`X-File-Size`	File size in bytes.
`Content-Type`	`application/octet-stream`

Python SDKcURL

Python

async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    session = await gaia.create_upload_session()

    result = await gaia.upload_file(
        session_id=session.upload_session_id,
        file_path="/path/to/report-q3.pdf",
    )
    print(f"Upload result: {result}")

Bash

curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  -H "Content-Type: application/octet-stream" \
  -H "X-Upload-Session-ID: sess-a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  -H "X-File-Name: report-q3.pdf" \
  -H "X-File-Size: $(stat -f%z report-q3.pdf)" \
  --data-binary @report-q3.pdf \
  "https://helios.cohesity.com/v2/mcm/gaia/upload-file"

Custom file name

The SDK's upload_file() method defaults file_name to the local file's basename. Pass file_name="custom-name.pdf" to override — useful when the local filename is a temporary or auto-generated string.

Uploading Multiple Files¶

Sequential Uploads¶

Python

from pathlib import Path
from gaia_sdk import GaiaClient

async def upload_directory(gaia: GaiaClient, directory: str):
    """Upload all supported files from a directory."""
    session = await gaia.create_upload_session()
    sid = session.upload_session_id

    files = list(Path(directory).glob("*"))
    print(f"Uploading {len(files)} files …")

    for i, file_path in enumerate(files, 1):
        if file_path.is_file():
            result = await gaia.upload_file(
                session_id=sid,
                file_path=str(file_path),
            )
            print(f"  [{i}/{len(files)}] {file_path.name} — {result}")

    print("All uploads complete.")
    return sid

Concurrent Uploads¶

For better throughput, upload files concurrently using asyncio.gather:

Python

import asyncio
from pathlib import Path
from gaia_sdk import GaiaClient

async def upload_directory_concurrent(
    gaia: GaiaClient,
    directory: str,
    max_concurrent: int = 5,
):
    """Upload files concurrently with a concurrency limit."""
    session = await gaia.create_upload_session()
    sid = session.upload_session_id
    semaphore = asyncio.Semaphore(max_concurrent)

    async def upload_one(file_path: Path):
        async with semaphore:
            result = await gaia.upload_file(
                session_id=sid,
                file_path=str(file_path),
            )
            print(f"  Uploaded: {file_path.name}")
            return result

    files = [f for f in Path(directory).iterdir() if f.is_file()]
    print(f"Uploading {len(files)} files (max {max_concurrent} concurrent) …")
    await asyncio.gather(*(upload_one(f) for f in files))

    print("All uploads complete.")
    return sid

Concurrency limit

Keep max_concurrent between 3 and 10. Too many parallel uploads can overwhelm the server or trigger rate limiting.

Associating Uploads with a Dataset¶

Uploaded files aren't searchable until they're linked to a dataset. Include the uploadSessionId when creating or updating a dataset:

Python

async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    # Upload files
    session = await gaia.create_upload_session()
    await gaia.upload_file(session.upload_session_id, "/path/to/doc1.pdf")
    await gaia.upload_file(session.upload_session_id, "/path/to/doc2.pdf")

    # Create a dataset that includes the uploaded files
    await gaia.create_dataset(
        name="user-uploads-q3",
        description="Q3 reports uploaded by the finance team",
        uploadSessionIds=[session.upload_session_id],
    )

    # Trigger indexing to make the files searchable
    await gaia.trigger_indexing("user-uploads-q3")

You can also add uploaded files to an existing dataset by updating its configuration to include the new session ID.

Supported File Formats¶

Format	Extensions	Notes
PDF	`.pdf`	Text-based and scanned (OCR may apply).
Microsoft Word	`.docx`, `.doc`
Microsoft Excel	`.xlsx`, `.xls`	Text from cells is extracted.
Microsoft PowerPoint	`.pptx`, `.ppt`	Slide text and notes are indexed.
Plain text	`.txt`, `.md`, `.csv`, `.log`	Ingested as-is.
HTML	`.html`, `.htm`	Tags are stripped; text content is indexed.
JSON / YAML	`.json`, `.yaml`, `.yml`	Serialized as text.
Email	`.eml`, `.msg`	Subject, body, and attachments are extracted.

Size limits

Individual file uploads are subject to a maximum size (typically 100 MB per file). For larger files, split them or use a data source in the dataset configuration instead.

Complete Example: Upload → Create Dataset → Query¶

An end-to-end workflow that uploads local files, creates a dataset, indexes it, and runs a query.

Python

import asyncio
from pathlib import Path
from gaia_sdk import GaiaClient

DATASET_NAME = "meeting-notes-2025"
UPLOAD_DIR = "./meeting-notes"

async def main():
    async with GaiaClient(api_key="YOUR_API_KEY") as gaia:

        # ── Upload ─────────────────────────────────────────────
        session = await gaia.create_upload_session()
        sid = session.upload_session_id
        print(f"Upload session: {sid}")

        files = [f for f in Path(UPLOAD_DIR).iterdir() if f.is_file()]
        for f in files:
            await gaia.upload_file(sid, str(f))
            print(f"  Uploaded: {f.name}")

        # ── Create dataset ─────────────────────────────────────
        await gaia.create_dataset(
            name=DATASET_NAME,
            description="Weekly meeting notes — 2025",
            uploadSessionIds=[sid],
        )
        print(f"\nDataset '{DATASET_NAME}' created.")

        # ── Index ──────────────────────────────────────────────
        await gaia.trigger_indexing(DATASET_NAME)
        print("Indexing triggered. Polling …")

        while True:
            details = await gaia.get_dataset(DATASET_NAME)
            stats = details.indexing_stats or {}
            status = stats.get("status", "Unknown")
            print(f"  {status} — {stats.get('indexedFiles', 0)}/{stats.get('totalFiles', 0)}")
            if status in ("Completed", "Failed"):
                break
            await asyncio.sleep(5)

        # ── Query ──────────────────────────────────────────────
        if status == "Completed":
            response = await gaia.ask(
                dataset_names=[DATASET_NAME],
                query="What decisions were made about the Q4 roadmap?",
            )
            print(f"\nAnswer:\n{response.response_string}")

asyncio.run(main())

Error Handling¶

Common upload errors and how to handle them:

Error	Cause	Resolution
400 Bad Request	Missing required headers or invalid session ID.	Verify `X-Upload-Session-ID`, `X-File-Name`, and `X-File-Size` are set.
401 Unauthorized	Invalid or missing API key.	Check the `apiKey` header.
413 Payload Too Large	File exceeds the size limit.	Split the file or use a data source instead.
429 Too Many Requests	Rate limit exceeded.	Reduce concurrency or add exponential backoff.

The SDK raises typed exceptions for each:

Python

from gaia_sdk import GaiaError, GaiaAuthError, GaiaRateLimitError

try:
    await gaia.upload_file(sid, "/path/to/large-file.pdf")
except GaiaRateLimitError:
    print("Rate limited — retrying after backoff …")
    await asyncio.sleep(10)
    await gaia.upload_file(sid, "/path/to/large-file.pdf")
except GaiaAuthError:
    print("Authentication failed. Check your API key.")
except GaiaError as e:
    print(f"Upload failed: {e} (HTTP {e.status_code})")

Best Practices¶

One session per logical batch

Group related files into a single upload session — e.g., all documents from one user submission or one pipeline run. This makes it easy to associate them with a dataset as a unit.

Validate before uploading

Check file extension and size client-side before calling the API. This avoids wasted bandwidth and provides faster feedback to users.

Implement retry logic

Network interruptions happen. Wrap upload_file() in a retry loop with exponential backoff for production workloads.

Track progress for UIs

For web applications, track upload progress per file and display it to the user. The sequential upload pattern naturally supports a progress bar (file n of N).

Don't forget to trigger indexing

Uploaded files are stored but not searchable until the dataset is indexed. Always call trigger_indexing() after associating uploads with a dataset.

What's Next¶

Project Structure — set up the recommended directory layout for your Gaia app.
Querying & RAG — query your uploaded documents with Gaia's RAG pipeline.
Datasets & Indexing — create and configure datasets to hold your uploaded documents.

Document Upload¶

When to Use Document Upload¶

How It Works¶

Step 1: Create an Upload Session¶

POST /upload-session¶

Step 2: Upload Files¶

POST /upload-file¶

Required Headers¶

Uploading Multiple Files¶

Sequential Uploads¶

Concurrent Uploads¶

Associating Uploads with a Dataset¶

Supported File Formats¶

Complete Example: Upload → Create Dataset → Query¶

Error Handling¶

Best Practices¶

What's Next¶

`POST /upload-session`¶

`POST /upload-file`¶