What Five Rounds of Testing Taught Us About Safety in Mental Health AI
After five rounds of longitudinal testing, one conclusion stands out clearly: safety is not a single dial you turn up or down. Model provider guardrails, application prompts, and a dedicated safety layer each play different and non interchangeable roles.
The data show a clear trend across several large API models. Over time they have become more conservative. Risky moves are rarer, but so are direct, grounded responses in difficult mental health conversations. Hedging and avoidance increase as safety policies tighten. In contrast, the same models wrapped with SASI stayed within a narrow, low pathology band across all metrics. They did not drift toward silence or over deflection.
Round 5 added an important counterpoint. A relatively unconstrained Llama base model combined with SASI reached a comparable safety zone, but with noticeably stronger therapeutic richness. This suggests that safety middleware can standardize risk behavior across very different underlying model profiles, without flattening the conversation. In other words, safety does not have to come at the cost of usefulness if it is implemented at the right layer.
A second conclusion is that integration details matter just as much as model choice. The native cleanup work on MyTrusted.ai surfaced a quiet but critical issue. Legacy adapters, frontend driven modes, and hidden fallback logic were subtly distorting SDK outputs and contaminating benchmarks, even when the SASI safety brain itself was performing correctly. By moving to direct SasiResult usage, config driven modes, real conversation history, and fail closed behavior when SASI is unavailable, the scores now reflect what the SDK is actually doing. They are no longer a blend of SASI plus leftover crisis detectors and emotional taggers from earlier architectures.
System prompts emerged as the third pillar. Rounds 4 and 5 made it clear that SASI does not need, and in some cases is actively harmed by, large over engineered system prompts. Minimal but precise prompts focused on identity, operating modes such as Default, Child, and Therapist, and fixed crisis language were sufficient. With clean wiring of MDTSAS scores and safety hints, even relatively small or lightly aligned models behaved like good adults in the room. They showed strong boundary integrity, early risk recognition, and appropriate crisis resource provision without an encyclopedic prompt trying to script every response. This is why future rounds will explicitly test families of prompts, from ultra minimal to more structured, to map how much prompt complexity is actually necessary once a dedicated safety SDK is in place.
Taken together, multi round testing, multi judge scoring, and continuous SDK cleanup have clarified the separation of concerns. Provider level safety stacks, application level prompts, and model portable safety middleware each have distinct responsibilities. The path forward reflects that clarity. Using MyTrusted.ai as a clean native SASI sandbox, experimenting with prompt templates across multiple LLMs, and continuing to benchmark twenty safety and quality metrics across many applications positions SASI not as just another chatbot, but as shared infrastructure for anyone trying to keep mental health conversations both safe and genuinely helpful.
Our latest round of testing can be found here: https://techviz.us/sasi-testing-results/sasi-testing-round-5