With Nerq vs Without Nerq

N=100 iterations per scenario. Pool of 50 agents (15 high-trust + 15 medium-trust + 10 low-trust + 10 dead/not-found). Each iteration selects 5 tools.

35.6%

Failure rate without Nerq

±19.8% SD

0.0%

Failure rate with Nerq

±0.0% SD

92.2

Avg trust (with Nerq)

±0.0 SD

Results (N=100 iterations)

Metric	Without Nerq	With Nerq	Improvement
Failure rate (mean ± SD)	35.6 ± 19.8%	0.0 ± 0.0%	(100% reduction)
Failure rate 95% CI	[31.7, 39.5]%	[0.0, 0.0]%	—
Avg trust (mean ± SD)	68.6 ± 9.5	92.2 ± 0.0	(+23.6)
Trust score 95% CI	[66.8, 70.5]	[92.2, 92.2]	—
Wasted API calls	1.8	45.0	—
Total API time	221ms	363ms	—

Statistical Significance

Test	Failure Rate	Trust Score
Test type	Welch's t-test	Welch's t-test
t-statistic	17.9678	-24.7497
p-value	p=0.000000 (significant)	p=0.000000 (significant)
Significant at 95%	Yes	Yes

Methodology

The pool contains 50 real agents from the Nerq index: 15 with trust > 70, 15 with trust 40–69, 10 with trust < 40, and 10 names that don’t exist in the index. Each iteration randomly selects 5 tools. This is repeated 100 times per scenario.

Without Nerq: Randomly pick 5 tools, call /v1/agent/kya/{name} for each. Tools with trust < 40 or not found count as failures.

With Nerq: Call /v1/preflight on all 50 candidates. Filter to PROCEED recommendations. Sort by trust descending. Pick top 5.

Statistical significance is assessed using Welch’s two-sample t-test (unequal variances) with a threshold of p < 0.05.

Reproduce

python -m agentindex.nerq_benchmark_test

Data from the Nerq index. Preflight API · LangChain integration · All reports