Sensitive Data Handling¶
Enterprise data often contains personally identifiable information (PII), financial records, health data, and other sensitive content. Gaia provides a sensitive data policies system that can automatically detect, mask, or redact sensitive information in query results before it reaches your application.
How It Works¶
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Indexed Data │────▶│ Gaia RAG │────▶│ Sensitive Data │────▶ Your App
│ (may contain │ │ Engine │ │ Policy Engine │ (masked)
│ PII, SSNs, │ │ (retrieval + │ │ (detect + mask) │
│ credit cards) │ │ generation) │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
When sensitive data policies are configured on a dataset:
- Gaia retrieves relevant documents as usual.
- Before generating the answer, the policy engine scans the retrieved context for sensitive patterns.
- Detected sensitive data is masked or redacted according to the active policy.
- The generated answer uses the masked context, so sensitive details never appear in the response.
Sensitive Data Categories¶
Gaia can detect and handle several categories of sensitive information:
| Category | Examples | Masking Behavior |
|---|---|---|
| PII | Names, email addresses, phone numbers, physical addresses | Replaced with [REDACTED-PII] |
| Financial | Credit card numbers, bank accounts, routing numbers | Replaced with [REDACTED-FINANCIAL] |
| Government IDs | SSNs, passport numbers, driver's license numbers | Replaced with [REDACTED-ID] |
| Health | Medical record numbers, diagnoses, prescription info | Replaced with [REDACTED-HEALTH] |
| Credentials | API keys, passwords, tokens | Replaced with [REDACTED-CREDENTIAL] |
Policy granularity
The exact categories and masking behavior depend on the policies available in your Gaia deployment. Use the List Policies API to see what's available.
API: GET /sensitive-data/policies¶
List all available sensitive data handling policies.
Request:
Headers:
Response:
{
"policies": [
{
"id": "pii-standard",
"name": "Standard PII Protection",
"description": "Detects and masks common PII including names, emails, phone numbers, and addresses.",
"categories": ["PII"],
"enabled": true
},
{
"id": "financial-strict",
"name": "Financial Data Protection",
"description": "Detects and redacts financial account numbers, credit card numbers, and related data.",
"categories": ["FINANCIAL", "GOVERNMENT_ID"],
"enabled": true
},
{
"id": "hipaa-compliance",
"name": "HIPAA Compliance",
"description": "Detects and redacts protected health information (PHI) as defined by HIPAA.",
"categories": ["HEALTH", "PII"],
"enabled": false
}
]
}
Response Fields¶
| Field | Type | Description |
|---|---|---|
id | string | Unique policy identifier |
name | string | Human-readable policy name |
description | string | What the policy protects |
categories | string[] | Sensitive data categories covered |
enabled | boolean | Whether the policy is currently active |
Backend Implementation¶
Listing Policies¶
Using the SDK:
from fastapi import APIRouter, Depends
from gaia_sdk import GaiaClient
from backend.api.dependencies import get_gaia_client
router = APIRouter()
@router.get("/sensitive-data/policies", tags=["Sensitive Data"])
async def list_policies(
client: GaiaClient = Depends(get_gaia_client),
):
"""List available sensitive data handling policies."""
policies = await client.list_sensitive_data_policies()
return {"policies": policies}
Using raw httpx:
import httpx
async def list_sensitive_data_policies(api_key: str) -> list[dict]:
"""Fetch all sensitive data policies from Gaia."""
async with httpx.AsyncClient() as client:
response = await client.get(
"https://helios.cohesity.com/v2/mcm/gaia/sensitive-data/policies",
headers={"apiKey": api_key},
)
response.raise_for_status()
data = response.json()
return data.get("policies", [])
Displaying Policy Status¶
Build a simple admin panel showing which policies are active:
@router.get("/admin/sensitive-data", tags=["Admin"])
async def sensitive_data_dashboard(
client: GaiaClient = Depends(get_gaia_client),
):
"""Return a summary of sensitive data policy status."""
policies = await client.list_sensitive_data_policies()
return {
"total_policies": len(policies),
"active_policies": [p for p in policies if p.get("enabled")],
"inactive_policies": [p for p in policies if not p.get("enabled")],
}
Configuring Policies on Datasets¶
Sensitive data policies are typically configured at the dataset level through the Cohesity management interface or API. When creating or updating a dataset, you can specify which policies apply:
async def configure_dataset_policies(
api_key: str,
dataset_name: str,
policy_ids: list[str],
) -> dict:
"""Apply sensitive data policies to a dataset."""
async with httpx.AsyncClient() as client:
response = await client.put(
f"https://helios.cohesity.com/v2/mcm/gaia/dataset/{dataset_name}/sensitive-data",
headers={
"apiKey": api_key,
"Content-Type": "application/json",
},
json={"policyIds": policy_ids},
)
response.raise_for_status()
return response.json()
Administrative operation
Configuring sensitive data policies requires administrative privileges on the Cohesity cluster. Regular API keys may not have permission to modify dataset policies. Contact your Cohesity administrator to enable policies.
Frontend: Policy Management UI¶
A read-only view of active policies, useful for compliance dashboards:
// src/components/SensitiveDataPolicies.tsx
import { useEffect, useState } from "react";
import { api } from "../api/client";
interface Policy {
id: string;
name: string;
description: string;
categories: string[];
enabled: boolean;
}
export function SensitiveDataPolicies() {
const [policies, setPolicies] = useState<Policy[]>([]);
const [loading, setLoading] = useState(true);
useEffect(() => {
api
.get<{ policies: Policy[] }>("/sensitive-data/policies")
.then((data) => setPolicies(data.policies))
.finally(() => setLoading(false));
}, []);
if (loading) return <p>Loading policies...</p>;
return (
<div className="space-y-4">
<h2 className="text-lg font-bold">Sensitive Data Policies</h2>
<div className="grid gap-3">
{policies.map((policy) => (
<div
key={policy.id}
className={`rounded-lg border p-4 ${
policy.enabled
? "border-green-200 bg-green-50"
: "border-gray-200 bg-gray-50"
}`}
>
<div className="flex items-center justify-between">
<h3 className="font-semibold">{policy.name}</h3>
<span
className={`rounded-full px-2 py-0.5 text-xs font-medium ${
policy.enabled
? "bg-green-100 text-green-700"
: "bg-gray-100 text-gray-500"
}`}
>
{policy.enabled ? "Active" : "Inactive"}
</span>
</div>
<p className="text-sm text-gray-600 mt-1">{policy.description}</p>
<div className="flex gap-1 mt-2">
{policy.categories.map((cat) => (
<span
key={cat}
className="rounded bg-gray-200 px-2 py-0.5 text-xs"
>
{cat}
</span>
))}
</div>
</div>
))}
</div>
</div>
);
}
Best Practices¶
Defense in depth
Sensitive data policies are one layer of protection. Also consider:
- Access control — Use Gaia's security context to restrict which users can query which datasets.
- Audit logging — Log all queries and the datasets accessed for compliance review.
- Data minimization — Only index datasets that your application actually needs.
- Network security — Use HTTPS for all Gaia API calls and enforce TLS in production.
- Enable policies before indexing — Policies work best when applied before data is indexed, so sensitive patterns are caught during ingestion.
- Test with known PII — Upload test documents with known sensitive patterns to verify that masking works as expected.
- Monitor policy coverage — Regularly review which policies are active and whether they cover all required data categories for your compliance requirements.
- Don't rely solely on masking — Masking reduces exposure but isn't foolproof. Context clues in surrounding text may still reveal sensitive information. Combine with access controls and audit logging.
Next Steps¶
- Helper LLM Integration — Use a secondary LLM for additional content processing.
- Dataset Discovery — Explore what's in your datasets.
- Error Handling — Handle policy-related errors.