PRIVACY
BY DESIGN β’ SESSION-BASED ARCHITECTURE
Document Intelligence Architecture
Stateless document processing with ephemeral memory architecture. Transform documents for AI chatbots,
RAG pipelines, and semantic searchβwith configurable cloud deployment options.
This architecture is documented as a reusable blueprint for multi-modal, privacy-first document
intelligence systems.
Interactive Architecture Diagrams
Click and explore the system architecture. These diagrams are dynamically rendered using Mermaid.js.
RAG Pipeline Sequence
sequenceDiagram
participant U as User
participant API as Flask API
participant R as Smart Router
participant E as Embedder
participant V as Vector Store
participant LLM as OpenRouter
U->>API: Upload Document
API->>API: Parse & Convert to MD
API->>E: Chunk & Embed
E->>V: Store Vectors
V-->>API: Indexed β
U->>API: Ask Question
API->>E: Embed Query
E->>V: Similarity Search
V-->>API: Top-K Chunks
API->>R: Analyze Complexity
R-->>API: Model Selection
API->>LLM: Context + Query
LLM-->>U: Streamed Response
Smart Router Decision Logic
flowchart TD
A[User Query] --> B{Complexity Analysis}
B -->|Simple Q&A| C[Flash Model]
B -->|Complex Reasoning| D[Pro Model]
B -->|Technical Code| E[Specialized Model]
C --> F{Domain Profile?}
D --> F
E --> F
F -->|Legal| G[Legal Prompt]
F -->|Medical| H[Medical Prompt]
F -->|Technical| I[Tech Prompt]
F -->|General| J[Base Prompt]
G --> K[OpenRouter Gateway]
H --> K
I --> K
J --> K
K --> L[Response + Metrics]
style A fill:#eff6ff,stroke:#2563eb
style C fill:#d1fae5,stroke:#10b981
style D fill:#fef3c7,stroke:#f59e0b
style E fill:#ede9fe,stroke:#8b5cf6
style L fill:#f0fdf4,stroke:#10b981
GCP Infrastructure Map
graph TB
subgraph Internet
U[Users]
GH[GitHub Actions]
end
subgraph GCP["Google Cloud Platform"]
LB[Cloud Load Balancer]
subgraph Compute["Compute Engine"]
VM[e2-micro VM]
GUNICORN[Gunicorn Workers]
FLASK[Flask App]
end
subgraph Security["Security Layer"]
FW[Firewall Rules]
SA[Service Account]
end
end
subgraph External["External Services"]
OR[OpenRouter API]
EMBED[Jina Embeddings]
end
U --> LB
LB --> FW
FW --> VM
VM --> GUNICORN
GUNICORN --> FLASK
FLASK --> OR
FLASK --> EMBED
GH -->|SSH Deploy| SA
SA --> VM
style LB fill:#4285f4,stroke:#1a73e8,color:#fff
style VM fill:#34a853,stroke:#0f9d58,color:#fff
style OR fill:#8b5cf6,stroke:#7c3aed,color:#fff
Stage 1: Multimodal Ingest & Processing
System Flow
β
π
Router
Text vs Vision
β
β
β
ποΈ
Cleanup
Ephemeral Storage
Requests are processed in-memory; transient cache entries exist for up to 30/10 minutes for async
workflows, then are wiped.
β
This architecture demonstrates a scalable, secure, and observable system for processing
documents and generating insights using Large Language Models.
Stage 2: RAG Preparation & Retrieval
Data preparation for AI chatbots, semantic search, and retrieval-augmented generation. Supports 100k+
document estates with multi-tenant collections and idempotent re-indexing.
ChromaDB
Collection ready
PostgreSQL
pgvector ready
Supports large document estates, multi-tenant indexing, and idempotent re-runs, so enterprises can plug
this stage into existing data lakes, catalogs, and governance workflows.
β
Stage 3: RAG Inference & Generation
Real-time retrieval and answer generation pipeline
Inference Flow
β
β
Vector Search
Top-K Chunks
β
β
SQL Intelligence Layer
NEW
Bring Your Own Database (BYOD) with natural language queries. Universal SQL Builder supports multiple
ingestion strategies with read-only security enforcement.
SQL Sandbox Flow
β
π
Format Detection
Universal SQL Builder
β
π
SQLite Instance
Read-Only Mode
β
π€
SQL Agent
LangChain + LLM
β
π
Results + SQL
Glass Box AI
Native SQLite
.db, .sqlite,
.sqlite3
Direct read-only URI
mode
SQL Dump Rehydration
.sql files
Temp DB from
executescript()
Spreadsheet Translation
.csv, .xlsx
Pandas normalization
β SQLite
π‘οΈ Security Boundaries
β Read-Only URI: ?mode=ro&uri=true
β 50MB upload limit
β 5s query timeout
β 1000 row result limit
β No DML (INSERT/UPDATE/DELETE blocked)
β Session-scoped ephemeral storage
Engineering Pipelines
π CI/CD Automation Pipeline
(Designed and Implemented by Me)
I designed and implemented a fully automated CI/CD pipeline that takes MegaDoc from commit to production
on GCP with DevSecOps (gitleaks, bandit, safety), quality gates, and zero-downtime deploys. This is the same
pipeline I use for my own projects and can adapt for client or employer environments.
CI/CD Flow
β
Code Quality
Lint + Security
β
β
β
CD: Deploy
GCP VM + nginx
Every push to main triggers a GitHub Actions workflow that runs security scans (gitleaks, bandit, safety),
quality gates, and smoke tests before deploying to GCP VM with nginx reverse proxy and Let's Encrypt SSL.
GitHub Platform
Features
- Multi-Stage Workflow: Build β Test β
Security β Deploy.
- Branch Protection: Required reviews,
status checks passed.
- GitHub Environments: Manual approval for
Production.
- Automated Versioning: Semantic Release based on commits.
Jira & Confluence
Integration
- Jira Sync: PRs validate issue keys and
update Jira status.
- PR Linking: All PRs must link to a Jira
Ticket.
- Build Summaries: CI status posted to
Jira comments.
- Release Logs: Confluence page updated automatically.
Zero-Downtime Deployment
- Staging Gate: Staging must pass before
Production deploy.
- PID-Based Management: Graceful process
restart with health checks.
- HTTPS Auto-Config: Let's Encrypt with auto-renewal.
Governance & Auditability
- Full Traceability: Commit β PR β Jira β
Build β Deploy link.
- Policy as Code: Security configs version-controlled and validated.
Zero Trust Security & Governance
Defense in Depth: In the current demo, the Zero Trust controls are simulated and
validated in design; the reference deployment uses Cloud Armor, Istio, and mTLS in GKE Autopilot.
Edge
Gateway
Model
Curator
Output
Infrastructure
Network:
- Cloud Armor (WAF)
- DDoS Protection
- Private VPC
Zero Trust:
- mTLS (Istio Mesh)
- Service Identity
- Rate Limiting
Application & AI
Input Guard:
- Prompt Injection
- PII Redaction
- Magic Byte Check
Output Guard:
- Hallucination Check
- Toxicity Filter
- Citation Verify
Compliance
Data:
- Encryption at Rest
- Ephemeral Storage (TTL-based)
- Data Sovereignty
Audit:
Immutable logs of all AI decisions and access events.
Technology Stack
The public demo uses lightweight SQLite and GCS for hosting, but the processing path is stateless and can be
swapped to fully ephemeral or enterprise data stores.
Backend
- Python 3.11+
- Flask
- MarkItDown
- SQLite
AI/NLP
- tiktoken
- scikit-learn
- sentence-transformers
- langdetect
Data
- ChromaDB / LanceDB
(Demo)
- Qdrant / PGVector
(Enterprise)
- GCS Hosting
- Embeddings
ChromaDB and LanceDB are used in the demo; Qdrant and pgvector/PostgreSQL are supported as
deployment targets for enterprise environments.
Security
- CSRF Protection
- Magic Byte File
Validation
- Rate Limiting
OWASP Top 10 controls designed into the gateway layer. Air-Gap Ready via pluggable local-inference
path.
Infrastructure
- GCP Compute Engine
- nginx + Let's Encrypt
- GitHub Actions CI/CD
Target Enterprise Architecture
(Reference)
Blueprint
This reference architecture is what I use to discuss trade-offs and adaptation paths when aligning with new
environments and constraints. This blueprint demonstrates how the system scales for
enterprise workloads (10k+ QPS).
β‘ Event-Driven Ingestion (Kafka +
GKE)
I designed this ingestion backbone using Kafka + GKE to handle large, bursty document flows from
enterprise systems (SharePoint, S3, etc.).
β
Apache Kafka
Event Backbone
β
Ingestion Pods
GKE Autopilot
π HA Inference Cluster (Istio +
vLLM)
Zero-trust service mesh with auto-scaling inference endpoints
β
Istio Gateway
mTLS / Rate Limit
β
Operational Excellence:
Observability & FinOps
π Full-Stack
Observability
-
β
Golden Signals:
Latency (P95/P99),
Error Rate, Traffic, Saturation.
-
β
AI Metrics:
Time-to-First-Token
(TTFT), Cache Hit Rate, RAG Retrieval Score.
-
β
Tracing:
OpenTelemetry for
end-to-end request tracing.
π° FinOps & Cost Strategy
-
β
Ingest Efficiency:
Spot Instances for
stateless worker nodes (60% savings).
-
β
Inference Scaling:
Scale-to-Zero policies
during off-peak hours.
-
β
Token Optimization:
Semantic Caching reduces
cost by 30%.