With Nerq vs Without Nerq

N=100 iterations per scenario. Pool of 50 agents (15 high-trust + 15 medium-trust + 10 low-trust + 10 dead/not-found). Each iteration selects 5 tools.

35.6%
Failure rate without Nerq
±19.8% SD
0.0%
Failure rate with Nerq
±0.0% SD
92.2
Avg trust (with Nerq)
±0.0 SD

Results (N=100 iterations)

MetricWithout NerqWith NerqImprovement
Failure rate (mean ± SD) 35.6 ± 19.8% 0.0 ± 0.0% (100% reduction)
Failure rate 95% CI [31.7, 39.5]% [0.0, 0.0]%
Avg trust (mean ± SD) 68.6 ± 9.5 92.2 ± 0.0 (+23.6)
Trust score 95% CI [66.8, 70.5] [92.2, 92.2]
Wasted API calls 1.8 45.0
Total API time 221ms 363ms

Statistical Significance

TestFailure RateTrust Score
Test type Welch's t-test Welch's t-test
t-statistic 17.9678 -24.7497
p-value p=0.000000 (significant) p=0.000000 (significant)
Significant at 95% Yes Yes

Methodology

The pool contains 50 real agents from the Nerq index: 15 with trust > 70, 15 with trust 40–69, 10 with trust < 40, and 10 names that don’t exist in the index. Each iteration randomly selects 5 tools. This is repeated 100 times per scenario.

Without Nerq: Randomly pick 5 tools, call /v1/agent/kya/{name} for each. Tools with trust < 40 or not found count as failures.

With Nerq: Call /v1/preflight on all 50 candidates. Filter to PROCEED recommendations. Sort by trust descending. Pick top 5.

Statistical significance is assessed using Welch’s two-sample t-test (unequal variances) with a threshold of p < 0.05.

Reproduce

python -m agentindex.nerq_benchmark_test

Data from the Nerq index. Preflight API · LangChain integration · All reports

We use cookies for analytics and caching. Privacy Policy