Architecture Overview
SafeLLM is designed as an ultra-lightweight and high-performance process (sidecar) that can be scaled horizontally in Kubernetes environments.
System Architecture
Section titled “System Architecture”SafeLLM typically operates as a sidecar to a network gateway like Apache APISIX.
graph LR Client[Client] --> Gateway[Apache APISIX] Gateway -- 1. Auth Request --> Sidecar[SafeLLM Sidecar] Sidecar -- 2. Decision --> Gateway Gateway -- 3. Forward (if OK) --> LLM[LLM Upstream]Waterfall Pipeline
Section titled “Waterfall Pipeline”The heart of SafeLLM is a multi-layered pipeline executing in a “waterfall” model. If any layer blocks a query, the process is short-circuited, saving resources and minimizing latency.
Layer Hierarchy
Section titled “Layer Hierarchy”- L0: Performance (Smart Cache) — Deduplicates security decisions for repetitive prompts in <0.1ms.
- L1: Static Guard (Keywords) — High-speed phrase filtering using optimized string matching algorithms.
- L1.5: PII Guard — Scans for sensitive data like emails or credit cards using regex (OSS) or AI (Enterprise).
- L2: Neural Guard (Enterprise) — Uses specialized ONNX models to detect advanced semantic prompt injections.
Pipeline Flow Diagram
Section titled “Pipeline Flow Diagram”graph TD A[Query] --> L0[L0: Cache] L0 -- HIT --> End[Result] L0 -- MISS --> L1[L1: Keywords] L1 -- BLOCKED --> End L1 -- OK --> L15[L1.5: PII Guard] L15 -- BLOCKED --> End L15 -- OK --> L2[L2: Neural Guard] L2 -- BLOCKED --> End L2 -- OK --> Target[LLM Model]DLP (Data Loss Prevention) Modes
Section titled “DLP (Data Loss Prevention) Modes”SafeLLM protects not only the input but also the output (model response).
- Block Mode: Full buffering of responses, scanning, and blocking if PII is detected.
- Anonymize Mode: Replacement of sensitive data with placeholders (e.g.,
[REDACTED:PHONE_NUMBER]). - Audit Mode: Asynchronous log scanning. Zero impact on user latency, full visibility for the security department.
Operational note: Block/anonymize modes require buffering the full response in memory before release. Size large responses accordingly and set DLP_MAX_OUTPUT_LENGTH to cap memory use.