Skip to content

L1.5: PII Shield

The L1.5 layer protects against sending personal data (PII) to LLM models and against its leakage in responses.

Depending on the version, SafeLLM offers two detection engines:

  • Technology: Optimized regular expressions with Luhn validation.
  • Purpose: Basic protection, ultra-high performance (~1-2ms).
  • Detected Entities:
Entity TypePattern DescriptionExamples
EMAIL_ADDRESSStandard email formatuser@domain.com
PHONE_NUMBERInternational/domestic formats+48 500 600 700, (555) 123-4567
CREDIT_CARDVisa, MasterCard, Amex, Discover (with Luhn validation)4111-1111-1111-1111
IP_ADDRESSIPv4 addresses192.168.1.100
IBAN_CODEInternational Bank Account NumbersDE89370400440532013000
CRYPTOBitcoin, Ethereum addresses0x742d35Cc6634...
US_SSNUS Social Security Numbers123-45-6789
POLISH_PESELPolish national ID (11 digits)90010112345
POLISH_NIPPolish tax ID (10 digits)123-456-78-90

The regex detector includes aggressive patterns to catch obfuscation attempts:

  • Credit cards with spaces between digits: 4 5 3 2 0 1 5 1...
  • SSNs with unusual separators: 1.2.3-4.5-6.7.8.9

Obfuscated patterns are validated with Luhn checksum for credit cards and SSA rules for SSNs to minimize false positives.

  • Technology: GLiNER (Generalist Model for Named Entity Recognition) language model.
  • Purpose: Precise detection in context, support for country-specific formats.
  • Advantages: Detects over 25 types of entities, including Polish ones: PESEL, NIP, REGON, Identity Card.
  • Performance: ~20-25ms on CPU.
VariableDescription
ENABLE_L3_PIIEnables the PII layer.
USE_FAST_PIItrue = Regex (OSS, default), false = GLiNER [Enterprise (Paid)].
L3_PII_ENTITIESList of entities to detect (e.g., ["EMAIL_ADDRESS", "POLISH_PESEL"]).
L3_PII_THRESHOLDConfidence threshold for the AI model (default 0.7).
L3_PII_LANGUAGELanguage code for GLiNER analysis (default en).
  • Input Filtering: To prevent users from sending sensitive data (like their own SSN or emails) to external LLM providers.
  • Privacy by Design: To ensure that PII is caught early in the pipeline, right after L1.
  • Hybrid Security: Use Regex (Fast) for common patterns and AI (GLiNER) for context-aware detection in regulated industries.
  • Regex Limitations: Regex can be bypassed by creative formatting (e.g., “e-mail at domain dot com”). Use Enterprise GLiNER for better recall.
  • Resource Consumption: GLiNER requires a CPU-intensive scan. In high-traffic environments, ensure sufficient CPU cores are allocated to the sidecar pods.
  • Custom PII Length Limit: Custom regex patterns are skipped for texts longer than CUSTOM_FAST_PII_MAX_TEXT_LENGTH (default 20,000 chars) to prevent ReDoS attacks. This limit is configurable to balance security and performance. Standard PII patterns are always scanned regardless of text length.
  • False Positives: Random strings that look like IDs (e.g., ACME-1234-5678) might be flagged as Credit Cards or SSNs. Use CUSTOM_FAST_PII_PATTERNS to define your own rules and reduce noise.

You can extend the PII detector by providing your own regular expressions for company-specific identifiers (e.g., Internal IDs, project codes).

To add custom patterns, use the CUSTOM_FAST_PII_PATTERNS environment variable. It accepts a JSON dictionary where the key is the entity name and the value is the regex pattern.

Terminal window
# Example: Adding internal ACME ID and Project Code
CUSTOM_FAST_PII_PATTERNS='{"ACME_ID": "ACME-[0-9]{4}", "PROJ_CODE": "PRJ-[A-Z]{3}"}'

To prevent ReDoS (Regular Expression Denial of Service) attacks, SafeLLM enforces several limits on custom regexes:

  1. Text Length Limit: Custom patterns are skipped for texts longer than CUSTOM_FAST_PII_MAX_TEXT_LENGTH (default: 20,000 chars).
  2. Pattern Count: Maximum of 50 custom patterns can be registered.
  3. Pattern Complexity: Maximum pattern length is 256 characters.
ENABLE_L3_PII=true
USE_FAST_PII=true
L3_PII_ENTITIES=["EMAIL_ADDRESS", "PHONE_NUMBER", "ACME_ID"]
CUSTOM_FAST_PII_PATTERNS='{"ACME_ID": "ACME-[0-9]{4}"}'
CUSTOM_FAST_PII_MAX_TEXT_LENGTH=20000

PII detection (especially in AI mode) has a built-in Circuit Breaker. If the detection engine starts reporting errors (e.g., out of RAM), the layer can switch to fail-open mode (letting traffic through) or fail-closed (blocking), depending on the FAIL_OPEN setting.

OSS note: In the OSS build, USE_FAST_PII=false is ignored and the regex detector is always used.