Skip to content

Document Upload

Not all data lives in a Cohesity View or NAS share. The document upload API lets you push files directly into Gaia — ideal for ad-hoc documents, user uploads in web apps, or CI/CD pipelines that inject generated content into a dataset.


When to Use Document Upload

Scenario Use Upload API?
Ad-hoc files from users (drag-and-drop in a web UI) Yes
Generated reports from a nightly pipeline Yes
Large file share with thousands of files No — use a data source in the dataset config
Cohesity-managed backup data No — add the View as a data source

Upload is best for small to moderate volumes of individual files. For large-scale ingestion, configure data sources directly in the dataset.


How It Works

Document upload is a two-step process:

sequenceDiagram
    participant Client
    participant Gaia

    Client->>Gaia: POST /upload-session
    Gaia-->>Client: { uploadSessionId: "sess-abc" }

    Client->>Gaia: POST /upload-file (file 1)
    Gaia-->>Client: 200 OK

    Client->>Gaia: POST /upload-file (file 2)
    Gaia-->>Client: 200 OK

    Note over Client,Gaia: Associate uploads with a dataset<br/>(create or update dataset config)
  1. Create an upload session — returns a uploadSessionId that groups related uploads together.
  2. Upload files — send each file as raw binary with metadata headers.

After uploading, the files need to be associated with a dataset (either during creation or by updating an existing dataset) so they become part of the searchable index.


Step 1: Create an Upload Session

POST /upload-session

Python
import asyncio
from gaia_sdk import GaiaClient

async def main():
    async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
        session = await gaia.create_upload_session()
        print(f"Session ID: {session.upload_session_id}")

asyncio.run(main())
Bash
curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  -H "Content-Type: application/json" \
  "https://helios.cohesity.com/v2/mcm/gaia/upload-session"

Response:

JSON
{
  "uploadSessionId": "sess-a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}

Session lifetime

Upload sessions remain active for a limited time (typically 24 hours). Complete all file uploads within this window.


Step 2: Upload Files

POST /upload-file

Send the file as raw binary in the request body with metadata in headers.

Required Headers

Header Description
X-Upload-Session-ID The session ID from Step 1.
X-File-Name The file name (e.g., report-q3.pdf).
X-File-Size File size in bytes.
Content-Type application/octet-stream
Python
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    session = await gaia.create_upload_session()

    result = await gaia.upload_file(
        session_id=session.upload_session_id,
        file_path="/path/to/report-q3.pdf",
    )
    print(f"Upload result: {result}")
Bash
curl -s -X POST \
  -H "apiKey: $GAIA_API_KEY" \
  -H "Content-Type: application/octet-stream" \
  -H "X-Upload-Session-ID: sess-a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  -H "X-File-Name: report-q3.pdf" \
  -H "X-File-Size: $(stat -f%z report-q3.pdf)" \
  --data-binary @report-q3.pdf \
  "https://helios.cohesity.com/v2/mcm/gaia/upload-file"

Custom file name

The SDK's upload_file() method defaults file_name to the local file's basename. Pass file_name="custom-name.pdf" to override — useful when the local filename is a temporary or auto-generated string.


Uploading Multiple Files

Sequential Uploads

Python
from pathlib import Path
from gaia_sdk import GaiaClient

async def upload_directory(gaia: GaiaClient, directory: str):
    """Upload all supported files from a directory."""
    session = await gaia.create_upload_session()
    sid = session.upload_session_id

    files = list(Path(directory).glob("*"))
    print(f"Uploading {len(files)} files …")

    for i, file_path in enumerate(files, 1):
        if file_path.is_file():
            result = await gaia.upload_file(
                session_id=sid,
                file_path=str(file_path),
            )
            print(f"  [{i}/{len(files)}] {file_path.name}{result}")

    print("All uploads complete.")
    return sid

Concurrent Uploads

For better throughput, upload files concurrently using asyncio.gather:

Python
import asyncio
from pathlib import Path
from gaia_sdk import GaiaClient

async def upload_directory_concurrent(
    gaia: GaiaClient,
    directory: str,
    max_concurrent: int = 5,
):
    """Upload files concurrently with a concurrency limit."""
    session = await gaia.create_upload_session()
    sid = session.upload_session_id
    semaphore = asyncio.Semaphore(max_concurrent)

    async def upload_one(file_path: Path):
        async with semaphore:
            result = await gaia.upload_file(
                session_id=sid,
                file_path=str(file_path),
            )
            print(f"  Uploaded: {file_path.name}")
            return result

    files = [f for f in Path(directory).iterdir() if f.is_file()]
    print(f"Uploading {len(files)} files (max {max_concurrent} concurrent) …")
    await asyncio.gather(*(upload_one(f) for f in files))

    print("All uploads complete.")
    return sid

Concurrency limit

Keep max_concurrent between 3 and 10. Too many parallel uploads can overwhelm the server or trigger rate limiting.


Associating Uploads with a Dataset

Uploaded files aren't searchable until they're linked to a dataset. Include the uploadSessionId when creating or updating a dataset:

Python
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
    # Upload files
    session = await gaia.create_upload_session()
    await gaia.upload_file(session.upload_session_id, "/path/to/doc1.pdf")
    await gaia.upload_file(session.upload_session_id, "/path/to/doc2.pdf")

    # Create a dataset that includes the uploaded files
    await gaia.create_dataset(
        name="user-uploads-q3",
        description="Q3 reports uploaded by the finance team",
        uploadSessionIds=[session.upload_session_id],
    )

    # Trigger indexing to make the files searchable
    await gaia.trigger_indexing("user-uploads-q3")

You can also add uploaded files to an existing dataset by updating its configuration to include the new session ID.


Supported File Formats

Format Extensions Notes
PDF .pdf Text-based and scanned (OCR may apply).
Microsoft Word .docx, .doc
Microsoft Excel .xlsx, .xls Text from cells is extracted.
Microsoft PowerPoint .pptx, .ppt Slide text and notes are indexed.
Plain text .txt, .md, .csv, .log Ingested as-is.
HTML .html, .htm Tags are stripped; text content is indexed.
JSON / YAML .json, .yaml, .yml Serialized as text.
Email .eml, .msg Subject, body, and attachments are extracted.

Size limits

Individual file uploads are subject to a maximum size (typically 100 MB per file). For larger files, split them or use a data source in the dataset configuration instead.


Complete Example: Upload → Create Dataset → Query

An end-to-end workflow that uploads local files, creates a dataset, indexes it, and runs a query.

Python
import asyncio
from pathlib import Path
from gaia_sdk import GaiaClient

DATASET_NAME = "meeting-notes-2025"
UPLOAD_DIR = "./meeting-notes"

async def main():
    async with GaiaClient(api_key="YOUR_API_KEY") as gaia:

        # ── Upload ─────────────────────────────────────────────
        session = await gaia.create_upload_session()
        sid = session.upload_session_id
        print(f"Upload session: {sid}")

        files = [f for f in Path(UPLOAD_DIR).iterdir() if f.is_file()]
        for f in files:
            await gaia.upload_file(sid, str(f))
            print(f"  Uploaded: {f.name}")

        # ── Create dataset ─────────────────────────────────────
        await gaia.create_dataset(
            name=DATASET_NAME,
            description="Weekly meeting notes — 2025",
            uploadSessionIds=[sid],
        )
        print(f"\nDataset '{DATASET_NAME}' created.")

        # ── Index ──────────────────────────────────────────────
        await gaia.trigger_indexing(DATASET_NAME)
        print("Indexing triggered. Polling …")

        while True:
            details = await gaia.get_dataset(DATASET_NAME)
            stats = details.indexing_stats or {}
            status = stats.get("status", "Unknown")
            print(f"  {status}{stats.get('indexedFiles', 0)}/{stats.get('totalFiles', 0)}")
            if status in ("Completed", "Failed"):
                break
            await asyncio.sleep(5)

        # ── Query ──────────────────────────────────────────────
        if status == "Completed":
            response = await gaia.ask(
                dataset_names=[DATASET_NAME],
                query="What decisions were made about the Q4 roadmap?",
            )
            print(f"\nAnswer:\n{response.response_string}")

asyncio.run(main())

Error Handling

Common upload errors and how to handle them:

Error Cause Resolution
400 Bad Request Missing required headers or invalid session ID. Verify X-Upload-Session-ID, X-File-Name, and X-File-Size are set.
401 Unauthorized Invalid or missing API key. Check the apiKey header.
413 Payload Too Large File exceeds the size limit. Split the file or use a data source instead.
429 Too Many Requests Rate limit exceeded. Reduce concurrency or add exponential backoff.

The SDK raises typed exceptions for each:

Python
from gaia_sdk import GaiaError, GaiaAuthError, GaiaRateLimitError

try:
    await gaia.upload_file(sid, "/path/to/large-file.pdf")
except GaiaRateLimitError:
    print("Rate limited — retrying after backoff …")
    await asyncio.sleep(10)
    await gaia.upload_file(sid, "/path/to/large-file.pdf")
except GaiaAuthError:
    print("Authentication failed. Check your API key.")
except GaiaError as e:
    print(f"Upload failed: {e} (HTTP {e.status_code})")

Best Practices

One session per logical batch

Group related files into a single upload session — e.g., all documents from one user submission or one pipeline run. This makes it easy to associate them with a dataset as a unit.

Validate before uploading

Check file extension and size client-side before calling the API. This avoids wasted bandwidth and provides faster feedback to users.

Implement retry logic

Network interruptions happen. Wrap upload_file() in a retry loop with exponential backoff for production workloads.

Track progress for UIs

For web applications, track upload progress per file and display it to the user. The sequential upload pattern naturally supports a progress bar (file n of N).

Don't forget to trigger indexing

Uploaded files are stored but not searchable until the dataset is indexed. Always call trigger_indexing() after associating uploads with a dataset.


What's Next