๐Ÿ›ก๏ธ

Guardrails Design โ€” Stopping Models Before They Go Wrong

Building safe harnesses with input filters, output validation, and action limits

Claude Code asks "should I delete this file?" That's an execution-stage guardrail. Delete without asking and there's no recovery.

Input Guardrails

Validate inputs before they reach the model.

Prompt injection defense โ€” separate user input so it can't overwrite system prompts. Isolating input with XML tags or delimiters is standard.

Token length limits โ€” when input exceeds the context window, you need a truncation strategy. Naive truncation drops critical info.

PII masking โ€” mask personal data (emails, phone numbers, SSN) before sending to the model.

Execution Guardrails

Restrict model actions when using tools.

Allowlist-based tool use โ€” whitelist callable tools. "File reads OK, file deletes need approval."

Timeout and retry limits โ€” prevent infinite loops. If 5 retries fail, escalate.

Sandboxed execution โ€” run code in isolated environments. Docker containers or VMs protect the host.

Output Guardrails

Validate model responses before delivering to users.

Schema validation โ€” verify JSON output matches expected schema. Catch missing required fields and type mismatches.

Harmful content filters โ€” block inappropriate model outputs. Sometimes checked by a separate classifier model.

Grounding checks โ€” verify that information the model referenced actually exists in the provided context. Prevents hallucination.

How It Works

1

Input guardrails โ€” prompt injection defense, token length limits, PII masking

2

Execution guardrails โ€” allowlist-based tool use, timeout, sandbox

3

Output guardrails โ€” schema validation, harmful content filter, grounding checks

4

All three stages needed for complete guardrails โ€” one alone leaves gaps

Use Cases

Claude Code โ€” approval before file changes/deletes, warnings for dangerous commands Customer support chatbot โ€” PII masking + content filter + refund amount cap CI/CD pipeline AI โ€” production deploy blocked, autonomous only to staging