Skip to content

Production Checklist

Use this checklist to verify your Gaia application is ready for production. Each section covers a critical area — work through them before your first deployment and revisit them before every major release.

How to use this checklist

Copy this page into your project's issue tracker or wiki and check off items as you complete them. Items marked with are must-have for any production deployment. The rest are strongly recommended.


Security

  • API keys are stored as environment variables or secrets — never committed to version control.
  • .env files are in .gitignore — only .env.example files with placeholder values are committed.
  • CORS is restrictedALLOW_CORS_ORIGIN is set to your specific frontend domain, not *.
  • HTTPS is enabled — all traffic between clients and your application is encrypted via TLS.
  • Backend is not directly exposed — the frontend nginx container proxies API calls; the backend port is not published to the public internet.
  • Rate limiting is configured — protect the backend from abuse with a middleware or reverse proxy rate limiter.
  • Input validation on all endpoints — Pydantic models validate request bodies; query parameters are typed.
  • Authentication on sensitive endpoints — session-based or token-based auth prevents unauthorized access to Gaia queries.
  • Security headers are setX-Content-Type-Options, X-Frame-Options, Strict-Transport-Security via nginx or middleware.
  • Dependency vulnerabilities are scanned — run pip audit and npm audit in CI.
Python
# Example: rate limiting middleware with slowapi
from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/api/ask")
@limiter.limit("10/minute")
async def ask_gaia(request: Request, body: AskRequest):
    ...

Performance

  • Connection pooling is enabled — the GaiaClient reuses HTTP connections via httpx.AsyncClient (already built into the SDK).
  • Response caching for repeated queries — cache frequent dataset listings and configuration responses.
  • Pagination is implemented — exhaustive search results use pagination tokens to avoid loading everything at once.
  • Frontend assets are cache-busted — Vite generates hashed filenames; nginx serves them with long cache TTLs.
  • Gzip compression is enabled — nginx compresses text responses.
  • Request timeout is tunedREQUEST_TIMEOUT_SECONDS is set appropriately (60s for standard queries, higher for exhaustive search).
  • Database queries are indexed — if using SQLite for session storage, ensure frequently queried columns have indexes.
Nginx Configuration File
# nginx gzip configuration
gzip on;
gzip_types text/plain text/css application/json application/javascript text/xml;
gzip_min_length 1000;

Reliability

  • Health check endpoints exist — both backend (/health) and frontend (nginx default) respond to health probes.
  • Graceful shutdown is handled — FastAPI shuts down cleanly on SIGTERM, completing in-flight requests.
  • Error responses are structured — all errors return consistent JSON with status, message, and detail fields.
  • Retry logic for transient failures — the SDK retries on 429 (rate limit) and 503 (service unavailable) with exponential backoff.
  • Circuit breaker for Gaia API — if the Gaia API is down, fail fast rather than accumulating timeouts.
  • Database migrations are automated — schema changes are applied on startup or via a migration tool.
  • Container restart policy is setrestart: unless-stopped in Docker Compose ensures recovery from crashes.
Python
# Example: structured error response
from fastapi import Request
from fastapi.responses import JSONResponse

@app.exception_handler(Exception)
async def global_exception_handler(request: Request, exc: Exception):
    return JSONResponse(
        status_code=500,
        content={
            "status": "error",
            "message": "An unexpected error occurred",
            "detail": str(exc) if app.debug else None,
        },
    )

Monitoring & Observability

  • Structured logging is configured — use JSON-formatted logs with request ID, timestamp, and severity level.
  • Request logging captures key metadata — log method, path, status code, duration, and user context for every request.
  • Error tracking is integrated — Sentry, Datadog, or similar captures unhandled exceptions with full stack traces.
  • Gaia API usage metrics are tracked — monitor query count, latency, token usage, and error rates.
  • Alerting is configured — get notified when error rates spike, latency exceeds thresholds, or health checks fail.
  • Log aggregation is set up — container logs are shipped to a centralized system (ELK, CloudWatch, Datadog).
Python
# Example: structured logging with structlog
import structlog

logger = structlog.get_logger()

@app.middleware("http")
async def log_requests(request: Request, call_next):
    import time
    start = time.perf_counter()
    response = await call_next(request)
    duration_ms = (time.perf_counter() - start) * 1000

    logger.info(
        "request_completed",
        method=request.method,
        path=request.url.path,
        status_code=response.status_code,
        duration_ms=round(duration_ms, 2),
    )
    return response

Scalability

  • Backend is stateless — no in-memory session state; all state is in SQLite/external database.
  • Session store can be externalized — when scaling beyond one backend instance, move sessions to Redis or PostgreSQL.
  • Load balancing is configured — if running multiple backend instances, a load balancer distributes traffic evenly.
  • Frontend is served from CDN — static assets are deployed to a CDN for global low-latency delivery.
  • Database can be upgraded — migration path from SQLite to PostgreSQL is documented for when traffic demands it.
  • Container resource limits are set — CPU and memory limits prevent a single container from consuming all host resources.
YAML
# docker-compose.yml resource limits
services:
  backend:
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 128M

Pre-Deployment Final Checks

  • All environment variables are set — run the app with Settings() to validate configuration.
  • Docker images build successfullydocker compose build completes without errors.
  • Health checks passdocker compose up shows both services as healthy.
  • End-to-end test passes — a query from the frontend reaches Gaia and returns a response.
  • SSL certificate is valid — check expiry and renewal automation.
  • Backup strategy is in place — SQLite database is backed up regularly if it stores important data.
  • Rollback plan is documented — you can revert to the previous version quickly if something goes wrong.
  • Runbook exists — common operational tasks (restart, scale, debug) are documented for the on-call team.

Quick Validation Script

Run this script to verify the most critical items programmatically:

Bash
#!/bin/bash
# validate-deployment.sh — Quick production readiness check

set -e

echo "=== Production Readiness Check ==="

# 1. Check that .env is not committed
if git ls-files --error-unmatch backend/.env 2>/dev/null; then
    echo "FAIL: backend/.env is tracked by git!"
    exit 1
else
    echo "PASS: .env files not in version control"
fi

# 2. Build images
echo "Building Docker images..."
docker compose build --quiet
echo "PASS: Docker images built successfully"

# 3. Start services
echo "Starting services..."
docker compose up -d

# 4. Wait for health checks
echo "Waiting for services to become healthy..."
for i in $(seq 1 30); do
    if docker compose ps | grep -q "(healthy)" && \
       ! docker compose ps | grep -q "(health: starting)"; then
        echo "PASS: All services healthy"
        break
    fi
    if [ "$i" -eq 30 ]; then
        echo "FAIL: Services did not become healthy within 30 seconds"
        docker compose logs
        docker compose down
        exit 1
    fi
    sleep 1
done

# 5. Test health endpoint
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/health)
if [ "$STATUS" -eq 200 ]; then
    echo "PASS: Backend health endpoint returns 200"
else
    echo "FAIL: Backend health endpoint returned $STATUS"
fi

# 6. Test frontend
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/)
if [ "$STATUS" -eq 200 ]; then
    echo "PASS: Frontend returns 200"
else
    echo "FAIL: Frontend returned $STATUS"
fi

docker compose down
echo "=== Check complete ==="

Next Steps