Routing accuracy
at 559 tools

Three eval sets, two model sizes. Same router, same FAISS indices, same manifests. The numbers below are from the latest router build (multilingual-e5-base, RERANK_ALPHA=0.7).

Setup

Headline result

Once routing is done by an embedding-based router, model size stops mattering. qwen2.5:3b and qwen2.5:14b achieve essentially identical CLI Acc on the 1,500-query paraphrase set.

ModelEvalN CLI AccTop-3 PickParamsE2E
qwen2.5:3bparaphrase1,500 81.80%83.40%98.08% 69.00%61.27%
qwen2.5:14bparaphrase1,500 81.67%83.27%98.08% 68.87%61.20%
Δ (3b − 14b) +0.13pp+0.13pp±0 +0.13pp+0.07pp
Implication.

When the router does the heavy lifting, scaling the LLM from 3B to 14B (4.7×) gives essentially zero additional accuracy. Compute spent on a bigger model is wasted — spend it on better intent_triggers instead.

An even stronger version of the same finding shows up on the in_domain set, where ~98% of queries hit Path A (template fill, no LLM at all):

ModelEvalNCLI AccTop-3A-path hit
qwen2.5:3bin_domain50086.80%88.00%98.64%
qwen2.5:14bin_domain50086.80%88.00%98.64%
Δ (3b − 14b)±0±0±0

Both models are identical to four decimal places — when 98.6% of queries skip the LLM entirely, model size literally cannot matter.

Adversarial set

The adversarial eval contains 204 intents engineered to mislead the router: surface-level keywords pointing to tool A while the semantic core is tool B (intent confusion, false friends, ambiguous scope, negation traps). All numbers below use the clibrary_top3 strategy.

StageCLI AccTop-3FAISS miss
Original FAISS58.0%78.9%43 / 204
+ trigger patch R1 (10 CLIs)73.5%88.2%24 / 204
+ trigger patch R2 (23 CLIs)83.3%97.1%6 / 204
+ trigger patch R3 (4 CLIs)86.3%100%0 / 204

Final result on the post-R3 index, both models:

ModelNCLI AccTop-3PickParamsE2E
qwen2.5:3b20484.31%100%84.31%86.27%77.94%
qwen2.5:14b20486.27%100%86.27%87.75%79.90%
3 rounds of trigger patching turned the adversarial set into a perfect-recall problem.

Top-3 = 100% means the right tool is always in the top three; the remaining gap (CLI Acc 84–86%) is the LLM picking wrong from those three. Bigger model helps slightly here (+1.96 pp) — the only place we see model size matter.

Other eval sets (qwen2.5:3b)

EvalStrategyNCLI AccTop-3E2E
in_domainclibrary50086.80%88.00%65.20%
paraphraseclibrary1,50081.80%83.40%61.27%
cross_domainclibrary32868.90%67.70%
adversarial (post R3)clibrary_top320484.31%100%77.94%

Latency & cost

Router-only metrics (does not include LLM call on Path B):

Reproducibility

The router code is on PyPI (pip install clibrary-hub) and GitHub (clibrary-hub/CLIbrary). Tools live at clibrary-hub/cli-tools. Eval sets are bundled with the router repo under benchmark/eval_sets/.