Skip to content

Architecture Overview

SafeLLM is designed as an ultra-lightweight and high-performance process (sidecar) that can be scaled horizontally in Kubernetes environments.

SafeLLM typically operates as a sidecar to a network gateway like Apache APISIX.

graph LR
Client[Client] --> Gateway[Apache APISIX]
Gateway -- 1. Auth Request --> Sidecar[SafeLLM Sidecar]
Sidecar -- 2. Decision --> Gateway
Gateway -- 3. Forward (if OK) --> LLM[LLM Upstream]

The heart of SafeLLM is a multi-layered pipeline executing in a “waterfall” model. If any layer blocks a query, the process is short-circuited, saving resources and minimizing latency.

  • L0: Performance (Smart Cache) — Deduplicates security decisions for repetitive prompts in <0.1ms.
  • L1: Static Guard (Keywords) — High-speed phrase filtering using optimized string matching algorithms.
  • L1.5: PII Guard — Scans for sensitive data like emails or credit cards using regex (OSS) or AI (Enterprise).
  • L2: Neural Guard (Enterprise) — Uses specialized ONNX models to detect advanced semantic prompt injections.
graph TD
A[Query] --> L0[L0: Cache]
L0 -- HIT --> End[Result]
L0 -- MISS --> L1[L1: Keywords]
L1 -- BLOCKED --> End
L1 -- OK --> L15[L1.5: PII Guard]
L15 -- BLOCKED --> End
L15 -- OK --> L2[L2: Neural Guard]
L2 -- BLOCKED --> End
L2 -- OK --> Target[LLM Model]

SafeLLM protects not only the input but also the output (model response).

  1. Block Mode: Full buffering of responses, scanning, and blocking if PII is detected.
  2. Anonymize Mode: Replacement of sensitive data with placeholders (e.g., [REDACTED:PHONE_NUMBER]).
  3. Audit Mode: Asynchronous log scanning. Zero impact on user latency, full visibility for the security department.

Operational note: Block/anonymize modes require buffering the full response in memory before release. Size large responses accordingly and set DLP_MAX_OUTPUT_LENGTH to cap memory use.