Benchmark
Routing accuracy
at 559 tools
Three eval sets, two model sizes. Same router, same FAISS indices, same manifests. The numbers below are from the latest router build (multilingual-e5-base, RERANK_ALPHA=0.7).
Setup
- Tool pool: 559 (CLIbrary 499 + MCP 60)
- Embedding model:
intfloat/multilingual-e5-base(768-d) - Strategy:
clibrary— FAISS top-1 → params via small LLM (or template fill on Path A) - Backends: Ollama local (qwen2.5:3b, qwen2.5:14b)
- Manifests: 504 in registry, intent_triggers patched 3 rounds
Headline result
Once routing is done by an embedding-based router, model size stops mattering. qwen2.5:3b and qwen2.5:14b achieve essentially identical CLI Acc on the 1,500-query paraphrase set.
| Model | Eval | N | CLI Acc | Top-3 | Pick | Params | E2E |
|---|---|---|---|---|---|---|---|
| qwen2.5:3b | paraphrase | 1,500 | 81.80% | 83.40% | 98.08% | 69.00% | 61.27% |
| qwen2.5:14b | paraphrase | 1,500 | 81.67% | 83.27% | 98.08% | 68.87% | 61.20% |
| Δ (3b − 14b) | +0.13pp | +0.13pp | ±0 | +0.13pp | +0.07pp | ||
When the router does the heavy lifting, scaling the LLM from 3B to 14B (4.7×) gives essentially zero additional accuracy. Compute spent on a bigger model is wasted — spend it on better intent_triggers instead.
An even stronger version of the same finding shows up on the in_domain set, where ~98% of queries hit Path A (template fill, no LLM at all):
| Model | Eval | N | CLI Acc | Top-3 | A-path hit |
|---|---|---|---|---|---|
| qwen2.5:3b | in_domain | 500 | 86.80% | 88.00% | 98.64% |
| qwen2.5:14b | in_domain | 500 | 86.80% | 88.00% | 98.64% |
| Δ (3b − 14b) | ±0 | ±0 | ±0 | ||
Both models are identical to four decimal places — when 98.6% of queries skip the LLM entirely, model size literally cannot matter.
Adversarial set
The adversarial eval contains 204 intents engineered to mislead the router:
surface-level keywords pointing to tool A while the semantic core is tool B
(intent confusion, false friends, ambiguous scope, negation traps).
All numbers below use the clibrary_top3 strategy.
| Stage | CLI Acc | Top-3 | FAISS miss |
|---|---|---|---|
| Original FAISS | 58.0% | 78.9% | 43 / 204 |
| + trigger patch R1 (10 CLIs) | 73.5% | 88.2% | 24 / 204 |
| + trigger patch R2 (23 CLIs) | 83.3% | 97.1% | 6 / 204 |
| + trigger patch R3 (4 CLIs) | 86.3% | 100% | 0 / 204 |
Final result on the post-R3 index, both models:
| Model | N | CLI Acc | Top-3 | Pick | Params | E2E |
|---|---|---|---|---|---|---|
| qwen2.5:3b | 204 | 84.31% | 100% | 84.31% | 86.27% | 77.94% |
| qwen2.5:14b | 204 | 86.27% | 100% | 86.27% | 87.75% | 79.90% |
Top-3 = 100% means the right tool is always in the top three; the remaining gap (CLI Acc 84–86%) is the LLM picking wrong from those three. Bigger model helps slightly here (+1.96 pp) — the only place we see model size matter.
Other eval sets (qwen2.5:3b)
| Eval | Strategy | N | CLI Acc | Top-3 | E2E |
|---|---|---|---|---|---|
| in_domain | clibrary | 500 | 86.80% | 88.00% | 65.20% |
| paraphrase | clibrary | 1,500 | 81.80% | 83.40% | 61.27% |
| cross_domain | clibrary | 328 | 68.90% | — | 67.70% |
| adversarial (post R3) | clibrary_top3 | 204 | 84.31% | 100% | 77.94% |
Latency & cost
Router-only metrics (does not include LLM call on Path B):
- Latency p50: 36 ms
- Latency p95: 39 ms
- Path A hit rate: ~80% of queries skip the LLM entirely
- Token cost (Path A): 0 — template fill is purely retrieval
- Token cost (Path B): ~150 — fixed regardless of tool count
Reproducibility
The router code is on PyPI (pip install clibrary-hub) and GitHub
(clibrary-hub/CLIbrary).
Tools live at clibrary-hub/cli-tools.
Eval sets are bundled with the router repo under benchmark/eval_sets/.