Benchmark

Routing accuracy
at 559 tools

Three eval sets, two model sizes. Same router, same FAISS indices, same manifests. The numbers below are from the latest router build (multilingual-e5-base, RERANK_ALPHA=0.7).

Setup

Tool pool: 559 (CLIbrary 499 + MCP 60)
Embedding model: intfloat/multilingual-e5-base (768-d)
Strategy: clibrary — FAISS top-1 → params via small LLM (or template fill on Path A)
Backends: Ollama local (qwen2.5:3b, qwen2.5:14b)
Manifests: 504 in registry, intent_triggers patched 3 rounds

Headline result

Once routing is done by an embedding-based router, model size stops mattering. qwen2.5:3b and qwen2.5:14b achieve essentially identical CLI Acc on the 1,500-query paraphrase set.

Model	Eval	N	CLI Acc	Top-3	Pick	Params	E2E
qwen2.5:3b	paraphrase	1,500	81.80%	83.40%	98.08%	69.00%	61.27%
qwen2.5:14b	paraphrase	1,500	81.67%	83.27%	98.08%	68.87%	61.20%
Δ (3b − 14b)			+0.13pp	+0.13pp	±0	+0.13pp	+0.07pp

Implication.

When the router does the heavy lifting, scaling the LLM from 3B to 14B (4.7×) gives essentially zero additional accuracy. Compute spent on a bigger model is wasted — spend it on better intent_triggers instead.

An even stronger version of the same finding shows up on the in_domain set, where ~98% of queries hit Path A (template fill, no LLM at all):

Model	Eval	N	CLI Acc	Top-3	A-path hit
qwen2.5:3b	in_domain	500	86.80%	88.00%	98.64%
qwen2.5:14b	in_domain	500	86.80%	88.00%	98.64%
Δ (3b − 14b)			±0	±0	±0

Both models are identical to four decimal places — when 98.6% of queries skip the LLM entirely, model size literally cannot matter.

Adversarial set

The adversarial eval contains 204 intents engineered to mislead the router: surface-level keywords pointing to tool A while the semantic core is tool B (intent confusion, false friends, ambiguous scope, negation traps). All numbers below use the clibrary_top3 strategy.

Stage	CLI Acc	Top-3	FAISS miss
Original FAISS	58.0%	78.9%	43 / 204
+ trigger patch R1 (10 CLIs)	73.5%	88.2%	24 / 204
+ trigger patch R2 (23 CLIs)	83.3%	97.1%	6 / 204
+ trigger patch R3 (4 CLIs)	86.3%	100%	0 / 204

Final result on the post-R3 index, both models:

Model	N	CLI Acc	Top-3	Pick	Params	E2E
qwen2.5:3b	204	84.31%	100%	84.31%	86.27%	77.94%
qwen2.5:14b	204	86.27%	100%	86.27%	87.75%	79.90%

3 rounds of trigger patching turned the adversarial set into a perfect-recall problem.

Top-3 = 100% means the right tool is always in the top three; the remaining gap (CLI Acc 84–86%) is the LLM picking wrong from those three. Bigger model helps slightly here (+1.96 pp) — the only place we see model size matter.

Other eval sets (qwen2.5:3b)

Eval	Strategy	N	CLI Acc	Top-3	E2E
in_domain	clibrary	500	86.80%	88.00%	65.20%
paraphrase	clibrary	1,500	81.80%	83.40%	61.27%
cross_domain	clibrary	328	68.90%	—	67.70%
adversarial (post R3)	clibrary_top3	204	84.31%	100%	77.94%

Latency & cost

Router-only metrics (does not include LLM call on Path B):

Latency p50: 36 ms
Latency p95: 39 ms
Path A hit rate: ~80% of queries skip the LLM entirely
Token cost (Path A): 0 — template fill is purely retrieval
Token cost (Path B): ~150 — fixed regardless of tool count

Reproducibility

The router code is on PyPI (pip install clibrary-hub) and GitHub (clibrary-hub/CLIbrary). Tools live at clibrary-hub/cli-tools. Eval sets are bundled with the router repo under benchmark/eval_sets/.

Routing accuracyat 559 tools

Setup

Headline result

Adversarial set

Other eval sets (qwen2.5:3b)

Latency & cost

Reproducibility

Routing accuracy
at 559 tools