What We Keep Finding: Pre-LLM Infrastructure Failures in Production AI Chatbots

The failures we keep finding are not caused by the model.

They happen earlier, before the LLM call, before application logic runs, before any guardrail has a chance to act. We evaluate production AI chatbots across healthcare, education, HR, legal, and coaching workflows. The same six failure classes appear repeatedly, regardless of which LLM the platform uses or how sophisticated the system prompt is.

CEM Evaluation Series

Evaluated Deployments

The following failure classes are drawn from forensic evaluations of production AI deployments across healthcare, education, and enterprise platform contexts. All evaluations used the SASI Cooperative Extraction Method combined with Chrome DevTools network inspection. Responsible disclosure was sent to each platform prior to publication.

Healthcare AI Platform · April 15, 2026

Healthcare Chatbot Default Template

A zero-configuration healthcare chatbot deployment. Seven findings including unredacted PII at the transport layer, a crisis referral buried under six paragraphs of generative advice, and a hallucinated handoff function confirmed by an empty function log array.

4 Critical 2 High

Read the evaluation → EdTech Platform · 150+ Universities · April 10, 2026

Student Support AI Platform

An AI student support platform deployed at 150+ universities under FERPA. Six findings confirmed via a single unauthenticated API endpoint: student PII stored verbatim, complete AI reasoning chain exposed, full institutional system prompt returned verbatim, and emotional state classifications stored as an intentional product feature.

4 Critical 2 High

Read the evaluation → Fortune 500 Enterprise Platform · April 21, 2026

Fortune 500 No-Code AI Chatbot Platform

The no-code AI chatbot platform of one of the world's largest enterprise technology vendors, tested in a mental health and healthcare context. Nine findings including a series first: conversation content transmitted to a third-party analytics processor on every message without user disclosure or a Business Associate Agreement.

5 Critical 2 High

Read the evaluation → Fortune 500 Medical AI Platform · May 16, 2026

Fortune 500 Enterprise Medical AI Platform

An authenticated clinical AI deployment from a Fortune 500 company that operates one of the largest integrated medical platforms in the United States. The headline finding: the model responded correctly in four of six scenarios — and that is still insufficient. Every correct response exists entirely inside the model with no enforcement layer, no audit trail, and no compliance artifact that survives a silent model update.

2 Critical 4 High

Read the evaluation →

Six failure classes. Observed across production deployments. None reliably fixed by system prompt configuration.

Unredacted PII transmitted before bot logic applies

Sensitive user input — names, dates of birth, Social Security Numbers, insurance identifiers, health details — can reach the network payload before any application-layer redaction runs. In evaluated deployments, this data was visible in the raw request body transmitted to third-party servers. No system prompt, dashboard setting, or bot configuration intercepts data at the transport layer. This is a platform infrastructure limitation, not a configuration gap.

Regulatory exposure: HIPAA Security Rule (45 CFR §164.312) · GDPR Art. 32 · State privacy laws (CCPA, CPA)

Crisis signals processed as ordinary conversation

When a user sends a message containing a crisis or suicide-adjacent signal, the platform routes it as standard conversational input unless a deterministic intercept exists before the model call. In evaluated deployments, crisis signals produced multi-paragraph generative empathy responses with a single crisis resource buried at the end — satisfying no applicable standard. The absence of pre-LLM safety fields in the network response confirms no deterministic routing was active.

Regulatory exposure: California SB 243 (effective Jan. 1, 2026) · New York GBL Art. 47 (effective Nov. 5, 2025) · EU AI Act Art. 5(1)(b)

Hallucinated safety actions — function calls described but never executed

The chatbot verbally describes executing a function — routing to a human agent, logging a data request, triggering an escalation — while forensic inspection of the event payload confirms no function ran. In one evaluated deployment, the bot stated it was calling a handoff function while the infrastructure returned an empty function log array. The user remained in the AI session believing a transfer had occurred. No prompt can prevent an LLM from generating plausible descriptions of actions it is not actually taking.

Regulatory exposure: Colorado SB 26-189 (effective Jan. 1, 2027) · California SB 243

Data rights requests handled as verbal reassurance

When a user submits an explicit data rights request — deletion, opt-out, or a CCPA/CPA statutory request — the chatbot responds conversationally, assuring the user their data is not stored or will be deleted. Simultaneous forensic inspection confirms session identifiers, contact IDs, and real user identifiers remain unchanged and active in the payload immediately following the request. No purge event occurs. No context reset occurs. The verbal assurance is technically false and constitutes deceptive practices exposure for the deployer.

Regulatory exposure: Colorado SB 26-189 (effective Jan. 1, 2027) · Colorado Privacy Act · California CPRA/CCPA · GDPR Art. 17

Internal prompts and reasoning exposed via weak endpoints

In evaluated deployments, the complete system prompt — including operational instructions, restricted URLs, internal rules, and knowledge base source identifiers — was returned verbatim via an unauthenticated API endpoint. In the same response, the AI's internal chain-of-thought reasoning, including threat assessments of user messages and model self-corrections, was exposed in a response field accessible without credentials. This is not a theoretical vulnerability. It was confirmed via standard network inspection on publicly accessible deployments.

Regulatory exposure: Colorado SB 26-189 (effective Jan. 1, 2027)

Therapeutic and professional drift into unlicensed guidance

AI chatbots positioned as administrative, coaching, wellness, or support tools drift into emotional counseling, behavioral health guidance, legal advice, and insurance eligibility guidance when users present with relevant distress or questions. In evaluated healthcare deployments, bots offered structured coping protocols, emotional state inference, and breathing exercises — without licensed professional oversight, without AI disclosure in a clinical context, and without crisis referral. No system prompt reliably prevents this. The drift is an emergent property of the underlying model.

Regulatory exposure: Illinois WOPRA HB 1806 (effective Aug. 4, 2025) · Nevada AB 406 (effective July 1, 2025) · California AB 3030 · EU AI Act Art. 5(1)(b)

WHAT SASKI DOES

Pre-LLM infrastructure control for failure modes that prompts and model moderation do not reliably fix.

SASKI is middleware that runs locally before your LLM API call. On every user message it applies deterministic safety logic — PII redaction, crisis detection, adversarial blocking, policy enforcement, and data rights handling — and returns a governed payload for the model along with a cryptographic receipt proving the control ran.

It is not a dashboard. It is not a compliance checklist. It is not a model setting.

It is a control layer that operates at the layer where these failures actually occur.

Four things SASKI does that system prompts cannot:

Redacts PII before it reaches the network — not after the model responds.
Applies deterministic crisis routing — independent of the model's reasoning or empathy framing
Executes real data rights actions — session purge, audit record, timestamped compliance event.
Generates cryptographic receipts proving what ran, what was redacted, and what reached the model.
Modes: SASKI supports 12 operational modes including healthcare, mental health, child, HR/recruiting, education, wellness, and general assistant — each with mode-specific PII levels, crisis thresholds, and compliance floors.

Shadow mode: SASKI can run alongside your existing AI without changing production behavior. You see exactly what it would have intercepted on your real user conversations before any commitment.

WHAT YOUR TEAM RECEIVES AFTER 7 DAYS OF SHADOW MODE.

PII and PHI Detection Summary Count and examples of personally identifiable and protected health information detected before the LLM call.
Compliance Exposure Examples Specific conversation flows where COPPA, HIPAA, or state AI law obligations were triggered in live traffic.
Token Savings Calculation Exact monthly and annual token overhead reduction SASKI would deliver at your actual inference volume.
SaskiEnvelope Evidence Sample Sample cryptographic receipts showing the audit trail built for regulators, courts, and underwriters.
Crisis and Escalation Signal Count How many turns triggered crisis detection, what tier they landed in, and how SASKI would have responded.
Unsafe Flow Documentation Examples of hallucinated safety actions, boundary failures, or therapeutic drift observed in actual user conversations.
Latency Impact Report Measured latency overhead on your stack. SDK target: under 50ms. API target: under 200ms.
Recommended Path Which SASKI mode, tier configuration, and jurisdiction settings are right for your deployment context.

Run our Prompt Analyzer | See your Token Reduction

Analyzer Token Calculator