Architecture Overview

SafeLLM is designed as an ultra-lightweight and high-performance process (sidecar) that can be scaled horizontally in Kubernetes environments.

System Architecture

SafeLLM typically operates as a sidecar to a network gateway like Apache APISIX.

graph LR
    Client[Client] --> Gateway[Apache APISIX]
    Gateway -- 1. Auth Request --> Sidecar[SafeLLM Sidecar]
    Sidecar -- 2. Decision --> Gateway
    Gateway -- 3. Forward (if OK) --> LLM[LLM Upstream]

Waterfall Pipeline

The heart of SafeLLM is a multi-layered pipeline executing in a “waterfall” model. If any layer blocks a query, the process is short-circuited, saving resources and minimizing latency.

Layer Hierarchy

L0: Performance (Smart Cache) — Deduplicates security decisions for repetitive prompts in <0.1ms.
L1: Static Guard (Keywords) — High-speed phrase filtering using optimized string matching algorithms.
L1.5: PII Guard — Scans for sensitive data like emails or credit cards using regex (OSS) or AI (Enterprise).
L2: Neural Guard (Enterprise) — Uses specialized ONNX models to detect advanced semantic prompt injections.

Pipeline Flow Diagram

graph TD
    A[Query] --> L0[L0: Cache]
    L0 -- HIT --> End[Result]
    L0 -- MISS --> L1[L1: Keywords]
    L1 -- BLOCKED --> End
    L1 -- OK --> L15[L1.5: PII Guard]
    L15 -- BLOCKED --> End
    L15 -- OK --> L2[L2: Neural Guard]
    L2 -- BLOCKED --> End
    L2 -- OK --> Target[LLM Model]

DLP (Data Loss Prevention) Modes

SafeLLM protects not only the input but also the output (model response).

Block Mode: Full buffering of responses, scanning, and blocking if PII is detected.
Anonymize Mode: Replacement of sensitive data with placeholders (e.g., [REDACTED:PHONE_NUMBER]).
Audit Mode: Asynchronous log scanning. Zero impact on user latency, full visibility for the security department.

Operational note: Block/anonymize modes require buffering the full response in memory before release. Size large responses accordingly and set DLP_MAX_OUTPUT_LENGTH to cap memory use.