Springer LNCS · May 2026

Detecting Stealth Sycophancy in Mental-Health Dialogue

Dynamic Emotional Signature Graphs — a model-agnostic evaluator that decouples clinical states and scores therapeutic progress with asymmetric geometry.

0
Mean Macro-F1
0
Dialogue Windows
0
Clinical Dimensions

Scroll

The Challenge

What is Stealth Sycophancy?

The blind spot in mental-health AI evaluation — surface empathy masking clinically directional deterioration

Surface Empathy

AI sounds warm, supportive, and understanding. Users feel heard — but clinical direction hasn't improved. The response validates without challenging.

Hidden Harm

Responses quietly reinforce catastrophic beliefs, hopeless predictions, and distorted self-labels. Polite on the surface, deteriorating underneath.

Directional Trap

Language is warm, but clinical trajectory is worsening. Traditional evaluators only see the surface — they can't detect the direction of change.

DESG's core insight: we must track clinical trajectory direction, not single-turn surface quality.

Core Insight

Clinical State Geometry

LLMs as sensors, not judges — extract clinical states, then score with geometry.

1

1548-D Clinical State Vector

Composite representation from dialogue: semantic (1536-D) + affective valence-arousal (2-D) + cognitive distortion distribution (10-D).

2

Directed Graph Captures Trajectory

State sequences become directed emotional signature graphs — encoding temporal evolution, not independent per-turn scoring.

3

Asymmetric Directional Distance

Distinguishes "improving" from "deteriorating" — exponential penalty for worsening, bounded reward for recovery.

Architecture

Four-Stage Pipeline

From raw dialogue to clinical safety score — end-to-end offline evaluation without LLM judges.

1

State Decoupling

Dialogue → 1548-D clinical state vector
h_sem ∥ h_emo ∥ h_cog

2

Asymmetric CDD

Clinical directional distance metric
Deterioration penalty, recovery reward

3

Graph Construction

DESG graph + Hungarian GED
Nodes = states, edges = KL divergence

4

Trajectory Scoring

Momentum reward + distortion penalty wall
Productive / Neutral / Harmful

DESG Method Overview
h_cog ∈ R10

10 CBT Cognitive Distortions

Each dimension corresponds to a clinically-defined cognitive distortion pattern, extracted as a probability simplex.

Catastrophizing
灾难化

Amplifying small issues into catastrophic consequences

Mind Reading
读心

Assuming others' thoughts without verification

Fortune Telling
预言未来

Predicting negative outcomes as certain

Should Statements
应该陈述

Imposing unreasonable "must/should" demands

Labeling
贴标签

Defining self/others with extreme labels

Mental Filter
心理过滤

Focusing only on negative details, ignoring positives

All-or-Nothing
全或无思维

Black-and-white extreme binary thinking

Overgeneralization
过度概括

Inferring universal rules from single events

Personalization
个人化

Attributing external events to oneself

Emotional Reasoning
情绪化推理

Taking feelings as evidence of facts

Structured extraction via LLM · Probability simplex output · Embedded as cognitive track in clinical state space

Stage 1

1548-D State Decoupling

Three independent clinical tracks, concatenated into a single state vector.

xt = [hsemhemohcog] ∈ R1548
1536-D

Semantic Track

MiniLM-L6-v2 embeddings, zero-padded to 1536 dimensions

2-D

Affective Track

Valence + Arousal from circumplex model

10-D

Cognitive Track

10-class CBT distortion probability simplex

Semantic dimensions occupy 99.2% — but ablation experiments prove clinical features are the core discriminative substrate.

Stage 2

Asymmetric Clinical Distance

Recovery is rewarded slowly; deterioration is penalized exponentially.

Baseline: Symmetric
0
d(A,B) = d(B,A)
All directions equal
Proposed: Asymmetric CDD
0
D(A,B) ≠ D(B,A)
Direction-aware scoring
Removing directionality → F1 drops from 0.9353 to 0.6239 (−33.3%)
Stage 3

Directed Emotional Signature Graph

Dialogue windows become directed graphs encoding temporal clinical state evolution.

N

Nodes: Clinical States

Each dialogue turn x_t maps to a graph node carrying the 1548-D state vector.

E

Edge Weights: KL Divergence

Cognitive distribution divergence between adjacent nodes + temporal penalty γ·Δt.

G

Hungarian GED Matching

Approximate graph edit distance via Hungarian algorithm O(n³) for optimal template matching.

KL = 0.12 KL = 0.34 KL = 0.71
x1
x2
x3
x4
Productive
Neutral
Harmful
Benchmark

3 × 1000 Cross-Domain Benchmark

Peer support, counseling dialogue, and crisis-oriented interaction — three distinct mental-health scenarios.

💬

Peer-ed

EmpatheticDialogues · Peer Support

0

dialogue windows

📋

Clinical-esconv

ESConv · Counseling Dialogue

0

dialogue windows

🚨

Crisis-cradle

CRADLE-Dialogue · Crisis Intervention

0

dialogue windows

Split per dataset: 600 train / 200 dev / 200 held-out test · Total: 3,000 windows

Results

Mean Macro-F1

3 × 200 held-out test windows · Sorted ascending

0.90
Praetor-7B
0.2559
Auto-J
0.3307
Prometheus-2
0.3535
DeepSeek-Judge
0.5876
TRACT
0.5972
BERTScore
0.7390
DESG-GatedANN
0.8370
DESG-Deep
0.8462
ConcatANN
0.9202
DESG-Ensemble
0.9353
DESG
Internal Variants
Text Baselines
External Evaluators
Ablation

Feature Ablation

Complete ConcatANN F1 = 0.9202 · Delta F1 when removing each feature

Semantic-only
−0.264
Mental Filter
−0.120
Valence
−0.103
Arousal
−0.101
Labeling
−0.085
Mind Reading
−0.083
Fortune Telling
−0.083
Clinical-only
−0.045

Semantic-only → F1 0.6559 (−28.7%)  ·  Clinical-only → F1 0.8755 (−4.9%)  ·  Full model → F1 0.9202

Variants

DESG Model Family

From brute-force retrieval to graph matching — Mean Macro-F1 on 3×200 held-out test.

DESG-Ensemble

Spatial + temporal dual-stream late fusion

0.9353

ConcatANN

Brute-force spatial retrieval, kNN + 1548-D

0.9202

DESG-Deep

MTP-pretrained temporal Transformer

0.8462

DESG-GatedANN

Gated attention retrieval variant

0.8370

LCM-learned

Learned clinical manifold angular metric

0.7667

Conclusion & Future Directions

Core Contributions

1

DESG-Ensemble achieves 0.9353 Mean F1 with 100% coverage and 100% sycophancy specificity.

2

Clinical state geometry is the core discriminative substrate: removing directionality causes −33.3% F1.

3

Distortion reinforcement alignment (0.93 F1) is the most reliable clinical audit anchor.

Future Directions

1

Large-scale clinical expert annotation and prospective clinical validation.

2

Multilingual and cross-cultural stress testing.

3

Pre-deployment prospective human-in-the-loop audit system.

Limitations: offline evaluation benchmark, not a clinical trial · Benchmarks contain construction artifacts · EITE is a stress-test diagnostic only.

Research Team

Shenzhen MSU-BIT University, Shenzhen, China

TH
Co-first Author
BX
Beining Xu
Co-first Author
HZ
Hanbo Zhang
Author
YL
Yongming Lu
Corresponding Author