Skip to main content

Methodology

How we score papers, what the numbers mean, and the audit trail every score carries.

Ground-truth disclosure. Opus-4.7-graded ground truth with founder spot-check, not founder-graded.

Aggregate Brier score — last 30 days

Lower is better. Brier score is mean squared error between the model's predicted probability and the eventual binary outcome. Our public methodology gate is ≤ 0.30; current value is 0.224 with 95% Wilson CI [0.176, 0.281].

0.000.150.300.35gate ≤ 0.302026-04-102026-05-09today 95% CI [0.176, 0.281]

Calibration plot

Each dot is one of ten predicted-probability deciles. The dashed grey line is perfect calibration; the solid blue line is the OLS fit through the deciles (slope 0.82, intercept 0.09).

0.000.000.250.250.500.500.750.751.001.00predicted probabilityobserved frequency

Reliability diagram

Bar height is the empirical observed-positive rate within each decile. Bar opacity scales with the number of predictions that landed in the bucket.

0.000.250.500.751.000.05n=30.15n=150.25n=260.35n=350.45n=470.55n=420.65n=260.75n=310.85n=130.95n=2predicted-probability decile

Per-sub-area Brier — 7 cs.AI sub-areas

Aggregate scores hide regression in any one sub-area. We track each of the seven cs.AI sub-areas separately so a regression in one can't get diluted by gains elsewhere. The bar marks the 95% Wilson interval; the dot marks the point estimate.

0.000.100.200.300.400.50cs.LG0.204 (n=200)cs.CV0.216 (n=200)cs.CL0.210 (n=200)cs.RO0.209 (n=200)cs.NE0.191 (n=200)cs.IR0.186 (n=200)cs.MA0.197 (n=200)

CitationVelocity weight

The Top-10 Daily Dashboard composes its rank from a base Signal-Fusion score plus a small per-paper velocity term:

signal_fusion_score = base_signal_fusion + velocity_weight * normalized_velocity_30d

Current velocity_weight= 0.30. The value is versioned in config/signal_fusion_weights.yaml and any change is routed through ShadowPromoter (kind=weight_promotion); BrierGate auto-reverts on calibration drop and starts the 7-day cooldown. velocity_weight=0 reproduces the prior P3 ranking byte-for-byte (zero-regression invariant).

How to verify

  • Read the JSON twin at /api/methodology.json — same numbers, agent-readable.
  • Pull the open-source Brier helpers from apps/web/lib/methodology/brier.ts and re-run the math against your own holdout.
  • Inspect the recompute cron at apps/api/cron/methodology_recompute.py (06:30 UTC daily, after the 06:00 scorecard sweep).