With Nerq vs Without Nerq
N=100 iterations per scenario. Pool of 50 agents (15 high-trust + 15 medium-trust + 10 low-trust + 10 dead/not-found). Each iteration selects 5 tools.
Results (N=100 iterations)
| Metric | Without Nerq | With Nerq | Improvement |
|---|---|---|---|
| Failure rate (mean ± SD) | 35.6 ± 19.8% | 0.0 ± 0.0% | (100% reduction) |
| Failure rate 95% CI | [31.7, 39.5]% | [0.0, 0.0]% | — |
| Avg trust (mean ± SD) | 68.6 ± 9.5 | 92.2 ± 0.0 | (+23.6) |
| Trust score 95% CI | [66.8, 70.5] | [92.2, 92.2] | — |
| Wasted API calls | 1.8 | 45.0 | — |
| Total API time | 221ms | 363ms | — |
Statistical Significance
| Test | Failure Rate | Trust Score |
|---|---|---|
| Test type | Welch's t-test | Welch's t-test |
| t-statistic | 17.9678 | -24.7497 |
| p-value | p=0.000000 (significant) | p=0.000000 (significant) |
| Significant at 95% | Yes | Yes |
Methodology
The pool contains 50 real agents from the Nerq index: 15 with trust > 70, 15 with trust 40–69, 10 with trust < 40, and 10 names that don’t exist in the index. Each iteration randomly selects 5 tools. This is repeated 100 times per scenario.
Without Nerq: Randomly pick 5 tools, call /v1/agent/kya/{name} for each.
Tools with trust < 40 or not found count as failures.
With Nerq: Call /v1/preflight on all 50 candidates.
Filter to PROCEED recommendations. Sort by trust descending. Pick top 5.
Statistical significance is assessed using Welch’s two-sample t-test (unequal variances) with a threshold of p < 0.05.
Reproduce
python -m agentindex.nerq_benchmark_test
Data from the Nerq index. Preflight API · LangChain integration · All reports