Methodology

How we score papers, what the numbers mean, and the audit trail every score carries.

Ground-truth disclosure. Opus-4.7-graded ground truth with founder spot-check, not founder-graded.

Aggregate Brier score — last 30 days

Lower is better. Brier score is mean squared error between the model's predicted probability and the eventual binary outcome. Our public methodology gate is ≤ 0.30; current value is 0.224 with 95% Wilson CI [0.176, 0.281].

Calibration plot

Each dot is one of ten predicted-probability deciles. The dashed grey line is perfect calibration; the solid blue line is the OLS fit through the deciles (slope 0.82, intercept 0.09).

Reliability diagram

Bar height is the empirical observed-positive rate within each decile. Bar opacity scales with the number of predictions that landed in the bucket.

Per-sub-area Brier — 7 cs.AI sub-areas

Aggregate scores hide regression in any one sub-area. We track each of the seven cs.AI sub-areas separately so a regression in one can't get diluted by gains elsewhere. The bar marks the 95% Wilson interval; the dot marks the point estimate.

CitationVelocity weight

The Top-10 Daily Dashboard composes its rank from a base Signal-Fusion score plus a small per-paper velocity term:

signal_fusion_score = base_signal_fusion + velocity_weight * normalized_velocity_30d

Current velocity_weight= 0.30. The value is versioned in config/signal_fusion_weights.yaml and any change is routed through ShadowPromoter (kind=weight_promotion); BrierGate auto-reverts on calibration drop and starts the 7-day cooldown. velocity_weight=0 reproduces the prior P3 ranking byte-for-byte (zero-regression invariant).

How to verify

Read the JSON twin at /api/methodology.json — same numbers, agent-readable.
Pull the open-source Brier helpers from apps/web/lib/methodology/brier.ts and re-run the math against your own holdout.
Inspect the recompute cron at apps/api/cron/methodology_recompute.py (06:30 UTC daily, after the 06:00 scorecard sweep).