Tool-Call Drift Detector

scenarios

USER PROMPT 0 chars

AVAILABLE FUNCTIONS (JSON — array of tool schemas) 0 chars

TOOL CALL MADE (JSON — {name, arguments}) 0 chars

initializing...

drift risk

—

decision

—

instrument

drift-v1

top signals driving the decision

how to read the verdict

OK (drift_risk < 0.50) — the tool call matches the stated intent and the schema

DRIFT (drift_risk ≥ 0.50) — the tool call has a mismatch with the prompt or the schema

top signals show which of 23 features drove the verdict. spurious_arg_frac catches hallucinated extra fields. missing_required_frac catches dropped required args. arg_order_inversion (new in v6.1) catches value-swapped arg_swap cases. tool_in_prompt / overlap_jaccard catch mis-targeted tool selection. This is calibrated logistic regression — you can read the coefficients directly in calibrated_weights_drift_v1.py.

the research — BFCL v3 5-fold CV + per-drift-type AUC

Trained on 3,700 labeled (prompt, functions, tool_call) triplets from Berkeley Function Calling Leaderboard v3 — 658 gold no-drift samples plus 3,042 drift positives via mutation (arg_swap, arg_drop, spurious_arg, tool_rename) and natural irrelevance splits. 5-fold stratified CV AUC 0.943 ± 0.009 (pooled 0.943) on the v6.1 retrain; v6.0 baseline was 0.916. Best null heuristic (schema_conformance) caps at 0.733 — the calibrated detector beats every null by +0.210.

spurious_arg AUC 0.997 clean capture
arg_drop AUC 0.997 clean capture
irrelevance_called AUC 0.980 null baseline 0.562
arg_swap AUC 0.755 v6.1 partial fix from 0.664 via arg_order_inversion
tool_rename AUC n/a n=1 in BFCL (not evaluable)

Remaining failure mode published openly: arg_swap at 0.755 — v6.1 positional-inversion feature lifts it from 0.664 but not all the way. Cases where both swapped values share a prompt position (numerical ambiguity, e.g. "divide 5 by 5") or one value was synthesized (not in prompt) still escape. Full close targeted v3 via embedding-based arg-value semantic fit per slot.

Where this sits vs prior work: the only published comparable baseline is Healy et al. 2026 ("Internal Representations as Indicators of Hallucinations in Agent Tool Selection"), which reports AUC 0.716–0.721 on Glaive (n=2,411) using last-layer hidden-state features fed to an MLP — requires model internals. styxx runs 0.943 on BFCL v3 with 23 text-only features, works on any closed model (OpenAI, Anthropic, Gemini) with zero weight access. This is empirical validation of cognometry's law II (cross-substrate universality) on the third instrument: same calibrated-LR methodology, third independent failure-mode family.

Reproducer: scripts/drift_calibrated_v1.py · dataset: data/drift_v0/drift_dataset_v0.jsonl. Everything reruns deterministically.

embed this verdict in your site