Aguardic logoAguardic

Why OPA and Rego Don't Work for AI Governance

OPA is excellent infrastructure policy. But AI governance requires semantic understanding, organizational context, and session-aware evaluation that a pattern-matching engine was never designed to handle. Here's where Rego breaks down — and what the alternative looks like.

Aguardic Team·December 29, 2025·15 min read

Open Policy Agent is one of the best pieces of infrastructure software ever built. It solved a real problem — how do you enforce authorization and admission control across distributed systems — and it solved it well enough that it became the default answer. Kubernetes admission control, API authorization, Terraform plan validation, microservice access policies. If you're enforcing structured policy against structured data in infrastructure, OPA with Rego is the right tool.

The problem is that people are now trying to use it for something it was never designed to do.

As organizations deploy AI systems — LLMs, autonomous agents, AI-assisted workflows — the governance requirements extend far beyond what OPA can handle. The inputs are unstructured. The rules require judgment, not just pattern matching. The context is organizational, not technical. And the evaluation needs to understand meaning, not just structure.

This isn't a criticism of OPA. It's a recognition that AI governance is a fundamentally different problem than infrastructure policy, and treating them as the same problem leads to governance systems that are technically sophisticated and practically useless.

Where OPA Excels

To understand where OPA breaks down, it helps to understand where it works perfectly.

OPA evaluates structured policy against structured data. You write rules in Rego — a purpose-built query language — and OPA evaluates those rules against JSON input. The input is well-defined. The rules are deterministic. The output is a boolean or a structured decision. Everything is fast, predictable, and auditable.

A Kubernetes admission controller checking whether a pod spec includes resource limits:

deny[msg] {
    input.request.kind.kind == "Pod"
    container := input.request.object.spec.containers[_]
    not container.resources.limits.memory
    msg := sprintf("Container %v must set memory limits", [container.name])
}

This is clean. The input is a JSON object with a well-known schema. The rule checks a specific field for a specific condition. The output is deterministic — the same input always produces the same result. There's no ambiguity about what "memory limits" means or whether the container "should" have them. It either does or it doesn't.

OPA handles this class of problem better than anything else on the market. Infrastructure admission control, API authorization, resource validation, network policy, RBAC — these are all structured-data, deterministic-rule problems, and OPA was purpose-built for them.

The question is what happens when the input isn't structured, the rules aren't deterministic, and the evaluation requires understanding meaning rather than checking fields.

Problem 1: Unstructured Input

The first thing that breaks is the input model.

OPA evaluates JSON. Every Rego rule operates on structured fields — input.request.kind.kind, input.spec.containers[_].resources. This works because infrastructure resources have schemas. A Kubernetes pod spec has a defined structure. A Terraform plan has a defined structure. An AWS IAM policy has a defined structure. You know what fields exist and what values they can contain.

AI governance inputs don't have this property. The content you need to evaluate is natural language — an LLM response, a document, an email, a Slack message, an AI agent's planned action described in prose. There is no input.response.contains_phi field. There is no input.content.sentiment field. The information you need to evaluate against policy is embedded in unstructured text, and extracting it requires understanding the text.

Consider a HIPAA compliance rule: "AI-generated content must not include protected health information in communications to unauthorized recipients." To evaluate this in OPA, you would first need to:

  1. Determine whether the content contains PHI — which requires understanding that "John Smith's diabetes medication was adjusted last Tuesday" contains PHI but "diabetes affects approximately 37 million Americans" does not
  2. Determine whether the recipient is authorized — which might require checking the recipient against an access control list, but might also require understanding organizational relationships that aren't in any database
  3. Determine whether the content constitutes a "communication" — an internal draft is different from an outbound email, which is different from a Slack message in a private channel

You could try to preprocess the content — run it through a PHI detection model, classify the recipient, categorize the content type — and then feed structured results into OPA. Some teams do this. The result is a fragile pipeline where the actual governance logic is split across multiple systems: a preprocessing layer that does the hard work of understanding the content, and OPA that checks the preprocessed results against simple rules. OPA becomes a glorified if-statement at the end of a chain that does the real evaluation elsewhere.

This isn't a hypothetical problem. We've talked to engineering teams at healthcare AI companies who built exactly this architecture. They spent months constructing preprocessing pipelines to extract structured features from unstructured content, wrote Rego rules against those features, and ended up with a system that was brittle (any change to the preprocessing broke the rules), slow (content had to pass through multiple models before policy evaluation), and incomplete (features they didn't think to extract weren't evaluated at all).

The alternative is an evaluation engine that handles unstructured input natively. Deterministic rules check the things that can be checked with patterns — keywords, regex, known identifiers, field conditions. Semantic AI evaluation handles the things that require understanding — tone, intent, context, meaning. The same policy can contain both types of rules, evaluated against the same input, in a single evaluation pass. No preprocessing pipeline. No feature extraction. No duct tape between a content understanding system and a policy engine.

Problem 2: Rules That Require Judgment

The second thing that breaks is the rule model.

Rego rules are deterministic. Given the same input, they always produce the same output. This is a feature for infrastructure policy — you want your admission controller to be predictable. But it's a fundamental limitation for AI governance, where many rules inherently require judgment.

"AI-generated customer communications must maintain a professional and empathetic tone."

What Rego rule catches this? You could try keyword matching — flag messages containing profanity or slang. But profanity detection doesn't evaluate tone. A message can be technically clean and deeply condescending. A message can use casual language and be perfectly appropriate for the context. Tone is a property of how something is said, not which words are used. Evaluating it requires understanding language the way a human reader would.

"AI-generated medical summaries must not overstate the certainty of diagnoses."

You can't write a Rego rule for this. The difference between "the patient has diabetes" and "lab results are consistent with a diabetes diagnosis, pending confirmation" is linguistic nuance — hedging language, epistemic qualifiers, degrees of certainty. A pattern-matching engine doesn't know that "consistent with" is hedged and "has" is definitive. Evaluating this requires semantic understanding of how certainty is expressed in clinical language.

"Contract terms generated by AI must not include indemnification clauses that exceed the scope approved by the legal team."

The word "indemnification" might appear in an approved clause and an unauthorized one. The difference is in the scope — unlimited indemnification versus indemnification capped at the contract value. Determining whether a specific indemnification clause exceeds approved scope requires comparing the generated clause against approved language, understanding the legal meaning of the terms, and making a judgment about whether the scope is equivalent.

These aren't edge cases. They're the core of AI governance. The rules that matter most — the ones that protect patients, customers, and organizations from AI-generated content that's technically correct but substantively wrong — are exactly the rules that Rego can't express.

A governance engine built for AI needs to support semantic rules natively: rules defined in natural language, evaluated by an LLM that understands meaning, with results that include explanations of why the content passed or failed. The rule definition looks like a requirement, not a query:

- id: professional-tone
  description: "Customer communications must maintain professional, empathetic tone"
  severity: MEDIUM
  type: semantic
  evaluation:
    prompt: |
      Evaluate whether this customer communication maintains a professional
      and empathetic tone. Consider: formality level, emotional awareness,
      respectful language, and appropriateness for a business context.
 
      Flag if the tone is condescending, dismissive, overly casual for the
      context, or lacks empathy when addressing customer concerns.

The rule is readable by anyone — not just Rego developers. The evaluation produces an explanation — not just a boolean. And the result captures nuance that a deterministic rule structurally cannot.

Problem 3: Organizational Context

The third thing that breaks is the context model.

OPA evaluates rules against the input it receives. If the information isn't in the input JSON, OPA doesn't know about it. You can preload data into OPA using bundles or external data sources, but the data must be structured, and the rules must know exactly which fields to check.

AI governance rules frequently depend on organizational context that doesn't fit this model — context that's scattered across documents, knowledge bases, and institutional knowledge that was never structured into JSON fields.

"AI-generated marketing copy must only include claims that appear in the approved messaging document."

The "approved messaging document" is a PDF. It contains paragraphs of approved language, lists of permitted claims, and nuanced guidance about when certain claims can and can't be used. To evaluate AI-generated copy against this document in OPA, you would need to extract every approved claim from the document, structure them as data, load them into OPA, and write Rego rules that compare generated content against the extracted claims. Every time the marketing team updates the approved messaging document, someone needs to re-extract the claims and update OPA's data bundle.

In practice, nobody does this. The approved messaging document stays in Google Drive, the AI generates whatever it generates, and someone in marketing spot-checks a sample. The governance gap isn't due to lack of intent — it's because the operational overhead of keeping OPA's data in sync with organizational documents is unsustainable.

Knowledge-grounded evaluation — what's sometimes called RAG-based policy evaluation — solves this by evaluating content directly against source documents. Upload the approved messaging document. The evaluation engine chunks it, embeds it, and stores it as a knowledge base. When AI-generated marketing copy needs to be evaluated, the engine retrieves the relevant sections of the approved messaging document and uses them as context for the evaluation. The semantic rule doesn't check a field — it compares the generated content against the source material and determines whether the claims align.

- id: approved-claims-only
  description: "Marketing claims must align with approved messaging document"
  severity: HIGH
  type: rag
  knowledge_source: approved-marketing-claims-2026
  evaluation:
    prompt: |
      Compare the following marketing content against the approved
      messaging document. Flag any claims that:
      - Do not appear in the approved messaging
      - Overstate or exaggerate approved claims
      - Make commitments not supported by the approved language

When the marketing team updates the document, they upload the new version. The knowledge base re-indexes. The policy evaluates against the current version automatically. No extraction, no data bundles, no manual sync.

This pattern applies everywhere organizational documents define governance rules. Brand guidelines. Underwriting standards. Contract templates. Regulatory frameworks. Clinical protocols. These documents represent the organization's own knowledge about what's acceptable — and in most organizations, that knowledge is completely disconnected from the systems that enforce policy.

Problem 4: Stateless Evaluation

OPA evaluations are stateless. Each evaluation is independent — it knows nothing about previous evaluations. This is fine for infrastructure policy, where each admission request is self-contained. A pod spec either has resource limits or it doesn't. The answer doesn't depend on what other pods were admitted earlier.

AI agent governance, as we described in detail in a previous post, is fundamentally stateful. An agent executes a sequence of actions over time. Whether a specific action is allowed depends on what the agent did earlier in the session — what data it accessed, what tools it called, what decisions it made.

You could theoretically model this in OPA by passing the entire session history as part of the input to every evaluation request. But Rego wasn't designed for this kind of temporal reasoning. Writing rules that say "if any previous action in this session accessed data tagged as PHI, and the current action sends content externally, then block" is technically possible in Rego but practically unwieldy. The rules become complex, the input payloads become large, and the debugging becomes nearly impossible because the evaluation depends on the accumulated state of an arbitrary number of prior actions.

Session-aware evaluation engines handle this natively. The session is a first-class concept — it has a lifecycle, it accumulates context across actions, and policy rules can reference session state directly. The rule fields.session.dataTags CONTAINS "PHI" is evaluated against a session context that the engine maintains automatically, updated with each action. The policy author doesn't need to reason about session history assembly — they write rules against session state the same way they write rules against any other input field.

Problem 5: The Rego Barrier

This is the most practical problem, and in many organizations, it's the one that actually kills OPA-based AI governance initiatives before they start.

Rego is a powerful, elegant language — for people who know Rego. For everyone else, it's a barrier.

AI governance policies are owned by compliance officers, legal teams, security leaders, and business stakeholders. These are the people who know what the rules should be. They know HIPAA requirements, brand guidelines, underwriting standards, and regulatory frameworks. They understand the organizational context that makes governance meaningful.

They do not write Rego.

deny[msg] {
    some i
    input.content.entities[i].type == "PHI"
    input.action.target.classification != "HIPAA_AUTHORIZED"
    msg := sprintf(
        "PHI entity '%v' cannot be sent to non-HIPAA-authorized target '%v'",
        [input.content.entities[i].value, input.action.target.name]
    )
}

For someone who reads Rego daily, this is clear. For the compliance officer who needs to define the policy, review the policy, and sign off on the policy — the person whose name goes on the compliance attestation — this is hieroglyphics. They can't verify that the rule correctly expresses their intent. They can't modify it when requirements change. They can't confidently tell an auditor that they understand what their policies enforce.

The result is a translation layer between the people who know the rules and the people who can write the code. The compliance team writes requirements in a document. An engineer translates them into Rego. The compliance team reviews the Rego and pretends they can verify it. The engineer pretends the compliance team's review was meaningful. Everyone pretends this is governance.

This isn't a skills gap that training solves. Compliance officers shouldn't need to learn a programming language to define governance policies. The policy definition language should be accessible to the people who own the policies — which means natural language descriptions, YAML-based rule definitions that read like requirements, and AI-assisted policy creation that lets a compliance officer describe a rule in plain English and get an enforceable policy back.

- id: phi-protection
  description: "Protected health information must not be sent to unauthorized recipients"
  severity: CRITICAL
  type: deterministic
  conditions:
    all:
      - field: content.data_tags
        operator: CONTAINS
        value: "PHI"
      - field: action.target
        operator: NOT_IN
        values: [list of HIPAA-authorized recipients]

A compliance officer can read this. They can verify that it matches their intent. They can modify it when requirements change. They can explain it to an auditor. The policy is owned by the person who understands the rules, not translated by someone who understands the language.

What Replaces OPA for AI Governance

Nothing — and that's the wrong question. OPA doesn't need to be replaced. It needs to stay where it's excellent — infrastructure policy — and a different system needs to handle what it can't.

AI governance requires a purpose-built engine that handles:

Unstructured input. Natural language content evaluated without preprocessing pipelines. Text in, policy decision out.

Multi-layer evaluation. Deterministic rules for the 60-70% of checks that are pattern-based. Semantic AI for the 25% that require judgment. Knowledge-grounded evaluation for the 10% that require organizational context. All three layers available in the same policy, evaluated against the same input.

Organizational knowledge. Policies grounded in the organization's own documents — brand guides, compliance manuals, regulatory frameworks — not just structured data loaded into bundles.

Session-aware evaluation. Stateful context that accumulates across agent actions, enabling cross-action policy rules that catch violations emerging from sequences, not individual events.

Accessible policy definitions. Rules defined in YAML and natural language, not a programming language. Owned by the people who understand the governance requirements, not translated by engineers.

Audit trails by default. Every evaluation logged with the policy version, the input, the result, and the explanation. Evidence generated as a natural output of enforcement, not assembled after the fact.

This is a different system than OPA because it solves a different problem. OPA governs infrastructure — whether a resource is allowed to exist, whether a request is authorized, whether a configuration meets requirements. AI governance governs content and behavior — whether an AI-generated output is safe, whether an agent action is authorized, whether a document complies with organizational rules.

The organizations that try to stretch OPA to cover both problems end up with the worst of both worlds: a complex, fragile system that does infrastructure policy well and AI governance poorly. The organizations that recognize these as separate problems — and use purpose-built tools for each — get infrastructure policy that's fast and deterministic and AI governance that handles nuance, context, and organizational knowledge.

OPA is excellent at what it does. AI governance is a different problem. Use the right tool for each.

Enjoyed this post?

Subscribe to get the latest on AI governance, compliance automation, and policy-as-code delivered to your inbox.

Ready to Govern Your AI?

Start enforcing policies across code, AI outputs, and documents in minutes.