← refusal playground

tool-call drift detector

cognometric instrument #3 — BFCL v3 5-fold CV AUC 0.943, 23 text-only features (v6.1 retrain), black-box compatible
pyodide loading... styxx 6.1.0 beats hidden-state baseline 0.72 → 0.943 MIT licensed
USER PROMPT 0 chars
AVAILABLE FUNCTIONS (JSON — array of tool schemas) 0 chars
TOOL CALL MADE (JSON — {name, arguments}) 0 chars
initializing...
drift risk
decision
instrument
drift-v1
top signals driving the decision
how to read the verdict
OK (drift_risk < 0.50) — the tool call matches the stated intent and the schema
DRIFT (drift_risk ≥ 0.50) — the tool call has a mismatch with the prompt or the schema
top signals show which of 23 features drove the verdict. spurious_arg_frac catches hallucinated extra fields. missing_required_frac catches dropped required args. arg_order_inversion (new in v6.1) catches value-swapped arg_swap cases. tool_in_prompt / overlap_jaccard catch mis-targeted tool selection. This is calibrated logistic regression — you can read the coefficients directly in calibrated_weights_drift_v1.py.
the research — BFCL v3 5-fold CV + per-drift-type AUC
Trained on 3,700 labeled (prompt, functions, tool_call) triplets from Berkeley Function Calling Leaderboard v3 — 658 gold no-drift samples plus 3,042 drift positives via mutation (arg_swap, arg_drop, spurious_arg, tool_rename) and natural irrelevance splits. 5-fold stratified CV AUC 0.943 ± 0.009 (pooled 0.943) on the v6.1 retrain; v6.0 baseline was 0.916. Best null heuristic (schema_conformance) caps at 0.733 — the calibrated detector beats every null by +0.210.

spurious_arg               AUC 0.997   clean capture
arg_drop                   AUC 0.997   clean capture
irrelevance_called       AUC 0.980   null baseline 0.562
arg_swap                   AUC 0.755   v6.1 partial fix from 0.664 via arg_order_inversion
tool_rename             AUC n/a     n=1 in BFCL (not evaluable)

Remaining failure mode published openly: arg_swap at 0.755 — v6.1 positional-inversion feature lifts it from 0.664 but not all the way. Cases where both swapped values share a prompt position (numerical ambiguity, e.g. "divide 5 by 5") or one value was synthesized (not in prompt) still escape. Full close targeted v3 via embedding-based arg-value semantic fit per slot.

Where this sits vs prior work: the only published comparable baseline is Healy et al. 2026 ("Internal Representations as Indicators of Hallucinations in Agent Tool Selection"), which reports AUC 0.716–0.721 on Glaive (n=2,411) using last-layer hidden-state features fed to an MLP — requires model internals. styxx runs 0.943 on BFCL v3 with 23 text-only features, works on any closed model (OpenAI, Anthropic, Gemini) with zero weight access. This is empirical validation of cognometry's law II (cross-substrate universality) on the third instrument: same calibrated-LR methodology, third independent failure-mode family.

Reproducer: scripts/drift_calibrated_v1.py · dataset: data/drift_v0/drift_dataset_v0.jsonl. Everything reruns deterministically.
powered by cognometry + styxx · github

embed this verdict in your site

paste this snippet anywhere — it renders a live detector widget. no install, no api key, works in any static html.
iframe snippet
live preview