Springer LNCS · May 2026

Detecting Stealth Sycophancy in Mental-Health Dialogue

Dynamic Emotional Signature Graphs — a model-agnostic evaluator that decouples clinical states and scores therapeutic progress with asymmetric geometry.

Explore the Method See Results

GitHub Slides

Mean Macro-F1

Dialogue Windows

Clinical Dimensions

Scroll

The Challenge

What is Stealth Sycophancy?

The blind spot in mental-health AI evaluation — surface empathy masking clinically directional deterioration

⚠

Surface Empathy

AI sounds warm, supportive, and understanding. Users feel heard — but clinical direction hasn't improved. The response validates without challenging.

⚡

Hidden Harm

Responses quietly reinforce catastrophic beliefs, hopeless predictions, and distorted self-labels. Polite on the surface, deteriorating underneath.

⇄

Directional Trap

Language is warm, but clinical trajectory is worsening. Traditional evaluators only see the surface — they can't detect the direction of change.

DESG's core insight: we must track clinical trajectory direction, not single-turn surface quality.

Core Insight

Clinical State Geometry

LLMs as sensors, not judges — extract clinical states, then score with geometry.

1548-D Clinical State Vector

Composite representation from dialogue: semantic (1536-D) + affective valence-arousal (2-D) + cognitive distortion distribution (10-D).

Directed Graph Captures Trajectory

State sequences become directed emotional signature graphs — encoding temporal evolution, not independent per-turn scoring.

Asymmetric Directional Distance

Distinguishes "improving" from "deteriorating" — exponential penalty for worsening, bounded reward for recovery.

Architecture

Four-Stage Pipeline

From raw dialogue to clinical safety score — end-to-end offline evaluation without LLM judges.

State Decoupling

Dialogue → 1548-D clinical state vector
h_sem ∥ h_emo ∥ h_cog

Asymmetric CDD

Clinical directional distance metric
Deterioration penalty, recovery reward

Graph Construction

DESG graph + Hungarian GED
Nodes = states, edges = KL divergence

Trajectory Scoring

Momentum reward + distortion penalty wall
Productive / Neutral / Harmful

h_cog ∈ R¹⁰

10 CBT Cognitive Distortions

Each dimension corresponds to a clinically-defined cognitive distortion pattern, extracted as a probability simplex.

Catastrophizing

灾难化

Amplifying small issues into catastrophic consequences

Mind Reading

读心

Assuming others' thoughts without verification

Fortune Telling

预言未来

Predicting negative outcomes as certain

Should Statements

应该陈述

Imposing unreasonable "must/should" demands

Labeling

贴标签

Defining self/others with extreme labels

Mental Filter

心理过滤

Focusing only on negative details, ignoring positives

All-or-Nothing

全或无思维

Black-and-white extreme binary thinking

Overgeneralization

过度概括

Inferring universal rules from single events

Personalization

个人化

Attributing external events to oneself

Emotional Reasoning

情绪化推理

Taking feelings as evidence of facts

Structured extraction via LLM · Probability simplex output · Embedded as cognitive track in clinical state space

Stage 1

1548-D State Decoupling

Three independent clinical tracks, concatenated into a single state vector.

x_t = [h_sem ∥ h_emo ∥ h_cog] ∈ R¹⁵⁴⁸

1536-D

Semantic Track

MiniLM-L6-v2 embeddings, zero-padded to 1536 dimensions

2-D

Affective Track

Valence + Arousal from circumplex model

10-D

Cognitive Track

10-class CBT distortion probability simplex

Semantic dimensions occupy 99.2% — but ablation experiments prove clinical features are the core discriminative substrate.

Stage 2

Asymmetric Clinical Distance

Recovery is rewarded slowly; deterioration is penalized exponentially.

Baseline: Symmetric

d(A,B) = d(B,A)
All directions equal

Proposed: Asymmetric CDD

D(A,B) ≠ D(B,A)
Direction-aware scoring

Removing directionality → F1 drops from 0.9353 to 0.6239 (−33.3%)

Stage 3

Directed Emotional Signature Graph

Dialogue windows become directed graphs encoding temporal clinical state evolution.

Nodes: Clinical States

Each dialogue turn x_t maps to a graph node carrying the 1548-D state vector.

Edge Weights: KL Divergence

Cognitive distribution divergence between adjacent nodes + temporal penalty γ·Δt.

Hungarian GED Matching

Approximate graph edit distance via Hungarian algorithm O(n³) for optimal template matching.

Productive

Neutral

Harmful

Benchmark

3 × 1000 Cross-Domain Benchmark

Peer support, counseling dialogue, and crisis-oriented interaction — three distinct mental-health scenarios.

💬

Peer-ed

EmpatheticDialogues · Peer Support

dialogue windows

📋

Clinical-esconv

ESConv · Counseling Dialogue

dialogue windows

🚨

Crisis-cradle

CRADLE-Dialogue · Crisis Intervention

dialogue windows

Split per dataset: 600 train / 200 dev / 200 held-out test · Total: 3,000 windows

Results

Mean Macro-F1

3 × 200 held-out test windows · Sorted ascending

0.90

Praetor-7B

0.2559

Auto-J

0.3307

Prometheus-2

0.3535

DeepSeek-Judge

0.5876

TRACT

0.5972

BERTScore

0.7390

DESG-GatedANN

0.8370

DESG-Deep

0.8462

ConcatANN

0.9202

DESG-Ensemble

0.9353

DESG

Internal Variants

Text Baselines

External Evaluators

Ablation

Feature Ablation

Complete ConcatANN F1 = 0.9202 · Delta F1 when removing each feature

Semantic-only

−0.264

Mental Filter

−0.120

Valence

−0.103

Arousal

−0.101

Labeling

−0.085

Mind Reading

−0.083

Fortune Telling

−0.083

Clinical-only

−0.045

Semantic-only → F1 0.6559 (−28.7%) · Clinical-only → F1 0.8755 (−4.9%) · Full model → F1 0.9202

Variants

DESG Model Family

From brute-force retrieval to graph matching — Mean Macro-F1 on 3×200 held-out test.

DESG-Ensemble

Spatial + temporal dual-stream late fusion

0.9353

ConcatANN

Brute-force spatial retrieval, kNN + 1548-D

0.9202

DESG-Deep

MTP-pretrained temporal Transformer

0.8462

DESG-GatedANN

Gated attention retrieval variant

0.8370

LCM-learned

Learned clinical manifold angular metric

0.7667

Conclusion & Future Directions

Core Contributions

DESG-Ensemble achieves 0.9353 Mean F1 with 100% coverage and 100% sycophancy specificity.

Clinical state geometry is the core discriminative substrate: removing directionality causes −33.3% F1.

Distortion reinforcement alignment (0.93 F1) is the most reliable clinical audit anchor.

Future Directions

Large-scale clinical expert annotation and prospective clinical validation.

Multilingual and cross-cultural stress testing.

Pre-deployment prospective human-in-the-loop audit system.

Limitations: offline evaluation benchmark, not a clinical trial · Benchmarks contain construction artifacts · EITE is a stress-test diagnostic only.

Detecting Stealth Sycophancy in Mental-Health Dialogue

What is Stealth Sycophancy?

Surface Empathy

Hidden Harm

Directional Trap

Clinical State Geometry

1548-D Clinical State Vector

Directed Graph Captures Trajectory

Asymmetric Directional Distance

Four-Stage Pipeline

State Decoupling

Asymmetric CDD

Graph Construction

Trajectory Scoring

10 CBT Cognitive Distortions

Catastrophizing

Mind Reading

Fortune Telling

Should Statements

Labeling

Mental Filter

All-or-Nothing

Overgeneralization

Personalization

Emotional Reasoning

1548-D State Decoupling

Semantic Track

Affective Track

Cognitive Track

Asymmetric Clinical Distance

Directed Emotional Signature Graph

Nodes: Clinical States

Edge Weights: KL Divergence

Hungarian GED Matching

3 × 1000 Cross-Domain Benchmark

Peer-ed

Clinical-esconv

Crisis-cradle

Mean Macro-F1

Feature Ablation

DESG Model Family

DESG-Ensemble

ConcatANN

DESG-Deep

DESG-GatedANN

LCM-learned

Conclusion & Future Directions

Core Contributions

Future Directions

Research Team