Document Upload¶
Not all data lives in a Cohesity View or NAS share. The document upload API lets you push files directly into Gaia — ideal for ad-hoc documents, user uploads in web apps, or CI/CD pipelines that inject generated content into a dataset.
When to Use Document Upload¶
| Scenario | Use Upload API? |
|---|---|
| Ad-hoc files from users (drag-and-drop in a web UI) | Yes |
| Generated reports from a nightly pipeline | Yes |
| Large file share with thousands of files | No — use a data source in the dataset config |
| Cohesity-managed backup data | No — add the View as a data source |
Upload is best for small to moderate volumes of individual files. For large-scale ingestion, configure data sources directly in the dataset.
How It Works¶
Document upload is a two-step process:
sequenceDiagram
participant Client
participant Gaia
Client->>Gaia: POST /upload-session
Gaia-->>Client: { uploadSessionId: "sess-abc" }
Client->>Gaia: POST /upload-file (file 1)
Gaia-->>Client: 200 OK
Client->>Gaia: POST /upload-file (file 2)
Gaia-->>Client: 200 OK
Note over Client,Gaia: Associate uploads with a dataset<br/>(create or update dataset config) - Create an upload session — returns a
uploadSessionIdthat groups related uploads together. - Upload files — send each file as raw binary with metadata headers.
After uploading, the files need to be associated with a dataset (either during creation or by updating an existing dataset) so they become part of the searchable index.
Step 1: Create an Upload Session¶
POST /upload-session¶
Response:
Session lifetime
Upload sessions remain active for a limited time (typically 24 hours). Complete all file uploads within this window.
Step 2: Upload Files¶
POST /upload-file¶
Send the file as raw binary in the request body with metadata in headers.
Required Headers¶
| Header | Description |
|---|---|
X-Upload-Session-ID | The session ID from Step 1. |
X-File-Name | The file name (e.g., report-q3.pdf). |
X-File-Size | File size in bytes. |
Content-Type | application/octet-stream |
curl -s -X POST \
-H "apiKey: $GAIA_API_KEY" \
-H "Content-Type: application/octet-stream" \
-H "X-Upload-Session-ID: sess-a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
-H "X-File-Name: report-q3.pdf" \
-H "X-File-Size: $(stat -f%z report-q3.pdf)" \
--data-binary @report-q3.pdf \
"https://helios.cohesity.com/v2/mcm/gaia/upload-file"
Custom file name
The SDK's upload_file() method defaults file_name to the local file's basename. Pass file_name="custom-name.pdf" to override — useful when the local filename is a temporary or auto-generated string.
Uploading Multiple Files¶
Sequential Uploads¶
from pathlib import Path
from gaia_sdk import GaiaClient
async def upload_directory(gaia: GaiaClient, directory: str):
"""Upload all supported files from a directory."""
session = await gaia.create_upload_session()
sid = session.upload_session_id
files = list(Path(directory).glob("*"))
print(f"Uploading {len(files)} files …")
for i, file_path in enumerate(files, 1):
if file_path.is_file():
result = await gaia.upload_file(
session_id=sid,
file_path=str(file_path),
)
print(f" [{i}/{len(files)}] {file_path.name} — {result}")
print("All uploads complete.")
return sid
Concurrent Uploads¶
For better throughput, upload files concurrently using asyncio.gather:
import asyncio
from pathlib import Path
from gaia_sdk import GaiaClient
async def upload_directory_concurrent(
gaia: GaiaClient,
directory: str,
max_concurrent: int = 5,
):
"""Upload files concurrently with a concurrency limit."""
session = await gaia.create_upload_session()
sid = session.upload_session_id
semaphore = asyncio.Semaphore(max_concurrent)
async def upload_one(file_path: Path):
async with semaphore:
result = await gaia.upload_file(
session_id=sid,
file_path=str(file_path),
)
print(f" Uploaded: {file_path.name}")
return result
files = [f for f in Path(directory).iterdir() if f.is_file()]
print(f"Uploading {len(files)} files (max {max_concurrent} concurrent) …")
await asyncio.gather(*(upload_one(f) for f in files))
print("All uploads complete.")
return sid
Concurrency limit
Keep max_concurrent between 3 and 10. Too many parallel uploads can overwhelm the server or trigger rate limiting.
Associating Uploads with a Dataset¶
Uploaded files aren't searchable until they're linked to a dataset. Include the uploadSessionId when creating or updating a dataset:
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
# Upload files
session = await gaia.create_upload_session()
await gaia.upload_file(session.upload_session_id, "/path/to/doc1.pdf")
await gaia.upload_file(session.upload_session_id, "/path/to/doc2.pdf")
# Create a dataset that includes the uploaded files
await gaia.create_dataset(
name="user-uploads-q3",
description="Q3 reports uploaded by the finance team",
uploadSessionIds=[session.upload_session_id],
)
# Trigger indexing to make the files searchable
await gaia.trigger_indexing("user-uploads-q3")
You can also add uploaded files to an existing dataset by updating its configuration to include the new session ID.
Supported File Formats¶
| Format | Extensions | Notes |
|---|---|---|
.pdf | Text-based and scanned (OCR may apply). | |
| Microsoft Word | .docx, .doc | |
| Microsoft Excel | .xlsx, .xls | Text from cells is extracted. |
| Microsoft PowerPoint | .pptx, .ppt | Slide text and notes are indexed. |
| Plain text | .txt, .md, .csv, .log | Ingested as-is. |
| HTML | .html, .htm | Tags are stripped; text content is indexed. |
| JSON / YAML | .json, .yaml, .yml | Serialized as text. |
.eml, .msg | Subject, body, and attachments are extracted. |
Size limits
Individual file uploads are subject to a maximum size (typically 100 MB per file). For larger files, split them or use a data source in the dataset configuration instead.
Complete Example: Upload → Create Dataset → Query¶
An end-to-end workflow that uploads local files, creates a dataset, indexes it, and runs a query.
import asyncio
from pathlib import Path
from gaia_sdk import GaiaClient
DATASET_NAME = "meeting-notes-2025"
UPLOAD_DIR = "./meeting-notes"
async def main():
async with GaiaClient(api_key="YOUR_API_KEY") as gaia:
# ── Upload ─────────────────────────────────────────────
session = await gaia.create_upload_session()
sid = session.upload_session_id
print(f"Upload session: {sid}")
files = [f for f in Path(UPLOAD_DIR).iterdir() if f.is_file()]
for f in files:
await gaia.upload_file(sid, str(f))
print(f" Uploaded: {f.name}")
# ── Create dataset ─────────────────────────────────────
await gaia.create_dataset(
name=DATASET_NAME,
description="Weekly meeting notes — 2025",
uploadSessionIds=[sid],
)
print(f"\nDataset '{DATASET_NAME}' created.")
# ── Index ──────────────────────────────────────────────
await gaia.trigger_indexing(DATASET_NAME)
print("Indexing triggered. Polling …")
while True:
details = await gaia.get_dataset(DATASET_NAME)
stats = details.indexing_stats or {}
status = stats.get("status", "Unknown")
print(f" {status} — {stats.get('indexedFiles', 0)}/{stats.get('totalFiles', 0)}")
if status in ("Completed", "Failed"):
break
await asyncio.sleep(5)
# ── Query ──────────────────────────────────────────────
if status == "Completed":
response = await gaia.ask(
dataset_names=[DATASET_NAME],
query="What decisions were made about the Q4 roadmap?",
)
print(f"\nAnswer:\n{response.response_string}")
asyncio.run(main())
Error Handling¶
Common upload errors and how to handle them:
| Error | Cause | Resolution |
|---|---|---|
| 400 Bad Request | Missing required headers or invalid session ID. | Verify X-Upload-Session-ID, X-File-Name, and X-File-Size are set. |
| 401 Unauthorized | Invalid or missing API key. | Check the apiKey header. |
| 413 Payload Too Large | File exceeds the size limit. | Split the file or use a data source instead. |
| 429 Too Many Requests | Rate limit exceeded. | Reduce concurrency or add exponential backoff. |
The SDK raises typed exceptions for each:
from gaia_sdk import GaiaError, GaiaAuthError, GaiaRateLimitError
try:
await gaia.upload_file(sid, "/path/to/large-file.pdf")
except GaiaRateLimitError:
print("Rate limited — retrying after backoff …")
await asyncio.sleep(10)
await gaia.upload_file(sid, "/path/to/large-file.pdf")
except GaiaAuthError:
print("Authentication failed. Check your API key.")
except GaiaError as e:
print(f"Upload failed: {e} (HTTP {e.status_code})")
Best Practices¶
One session per logical batch
Group related files into a single upload session — e.g., all documents from one user submission or one pipeline run. This makes it easy to associate them with a dataset as a unit.
Validate before uploading
Check file extension and size client-side before calling the API. This avoids wasted bandwidth and provides faster feedback to users.
Implement retry logic
Network interruptions happen. Wrap upload_file() in a retry loop with exponential backoff for production workloads.
Track progress for UIs
For web applications, track upload progress per file and display it to the user. The sequential upload pattern naturally supports a progress bar (file n of N).
Don't forget to trigger indexing
Uploaded files are stored but not searchable until the dataset is indexed. Always call trigger_indexing() after associating uploads with a dataset.
What's Next¶
- Project Structure — set up the recommended directory layout for your Gaia app.
- Querying & RAG — query your uploaded documents with Gaia's RAG pipeline.
- Datasets & Indexing — create and configure datasets to hold your uploaded documents.