What HIPAA actually demands of an LLM feature
HIPAA almost never blocks an LLM feature on the model — it blocks on BAAs, PHI in your logs, and the vendor chain. Here's what compliance actually requires.
Last updated
Most teams building a healthcare LLM feature brace for a fight about the model. The HIPAA review is almost never about the model. It’s about whether you can prove that protected health information (PHI) was contracted for, minimized, encrypted, logged, and owned — across every vendor that touched it. The model being impressive is not on that list.
This is engineering guidance, not legal advice — your privacy counsel has the final word. But these are the gaps that fail the review, and they are all infrastructure.
HIPAA asks a different question than your demo answered
Your prototype answered can the model do this? HIPAA asks can you account for the PHI? Those are different questions, and the prototype skipped the second one because skipping it is what made it fast. The Security Rule doesn’t grade your prompt; it grades the safeguards — administrative, physical, and technical — around the data. An LLM feature is just a new place PHI flows, and every place PHI flows has to be accounted for.
The requirements, mapped to an LLM feature
| HIPAA requirement | What it means for your LLM feature | Common fail |
|---|---|---|
| Business Associate Agreement | Every vendor that processes PHI — model provider, observability, eval tooling, vector store — must sign a BAA | ”We just call the API.” No BAA, or a logging/eval SaaS quietly ingesting PHI with none |
| Minimum necessary | Put only the PHI the task needs into the prompt — not the whole record | Dumping the full chart into context “so the model has background” |
| Encryption | TLS to the API; encrypt the prompt/response store and any vector index at rest | PHI embeddings sitting in an unencrypted index |
| Audit controls | Log who invoked it, on whose record, and what came back — and protect those logs as PHI | No traceability, or traces in plaintext readable by anyone with dashboard access |
| Access controls | Role-scoped access to the feature and its logs; the model must not surface records the user couldn’t see | Retrieval that reaches across patients with no row-level scoping |
| Integrity & output control | Guardrails so an output can’t disclose the wrong person’s PHI; human review where the stakes warrant it | No output filtering; the model free-associates across whatever is in context |
| Assigned security responsibility | A named person owns this feature’s PHI risk and is accountable for it | ”The AI team” owns it — i.e., no one |
None of these is about model quality. All of them are about the system around the model.
The trap: the things that make it production-ready also create exposure
Here is the part that catches good teams. The two practices that move an LLM feature toward production-grade — observability and evaluation — are also where PHI leaks. Traces capture prompts and outputs; those now contain PHI, so your tracing backend is a business associate and the traces are regulated records. Eval datasets built from real interactions are PHI sitting in a second system. The instinct to “log everything” and “build a golden dataset from prod” is correct for reliability and radioactive for compliance — unless every link in that chain has a BAA and the data is access-controlled or de-identified.
This is why HIPAA and the Production-Readiness Bar overlap so tightly: Guardrails, Observability, and Ownership are exactly the dimensions a regulated launch lives or dies on. HIPAA just turns “good practice” into “legal requirement” and removes the option to defer them.
De-identification is a design lever, not an afterthought
The most effective move is often to keep PHI out of the model’s path entirely. If you can de-identify before the prompt — Safe Harbor or Expert Determination — much of the burden above falls away for that flow, because de-identified data isn’t PHI. That is an architecture decision, and it is made early: what is the minimum identifiable data this feature truly needs, and can the rest be stripped, tokenized, or re-attached on your side after the model returns?
Why the prototype passed and the review won’t
A healthcare LLM demo fails its compliance review for the same reason it stalls before production: the demo deferred the unglamorous parts, and here those parts are also the law. The fix isn’t a better model — it’s BAAs across the vendor chain, minimum-necessary prompts, encrypted and access-controlled logs, output guardrails, and a named owner. That is the work.
If you’re staring at a stalled healthcare AI feature and a security review you’re not sure you’ll pass, that’s the production-readiness audit: a fixed-scope assessment against the bar — including the HIPAA-specific gaps above — with a prioritized risk register and a scoped path to close them.
Questions this raises
Straight answers.
- Is ChatGPT — or any LLM API — HIPAA compliant?
- A model API isn't "HIPAA compliant" by itself; compliance is a property of your whole system and your contracts. What matters is whether the provider will sign a Business Associate Agreement (BAA) and whether you've built the integration to meet the Security Rule. Some providers offer a BAA on specific plans — without one, sending PHI to them is a violation no matter how good the model is.
- Can we send PHI to an LLM API?
- Only if there's a signed BAA with that provider and every downstream subprocessor that sees the data, you send the minimum necessary, it's encrypted in transit, and the prompts and responses are logged and access-controlled like any other PHI. If you can't satisfy those, de-identify the data before it reaches the model or keep it out of the prompt.
- Does HIPAA regulate the model's accuracy or output quality?
- Not directly. HIPAA governs the confidentiality, integrity, and availability of PHI, not whether an answer is correct. But it bites the moment a wrong or over-broad output discloses the wrong patient's information — so output controls and human review sit on the boundary between a quality problem and a disclosure problem.
- What usually blocks an LLM feature in a HIPAA review?
- Rarely the model. Usually it's no BAA with the model provider, PHI sitting in prompt and response logs or an observability vendor with no BAA, no minimum-necessary discipline in how prompts are built, and no named owner for the feature's PHI risk. Those are the same gaps a production-readiness review surfaces — HIPAA just makes them legally non-optional.
Production-Readiness Audit
This is the work, not just the writeup.
If this is your situation, the production-readiness audit is where it gets fixed — by the person who wrote this.