Methodology
How we score papers, what the numbers mean, and the audit trail every score carries.
Ground-truth disclosure. Opus-4.7-graded ground truth with founder spot-check, not founder-graded.
Aggregate Brier score — last 30 days
Lower is better. Brier score is mean squared error between the model's predicted probability and the eventual binary outcome. Our public methodology gate is ≤ 0.30; current value is 0.224 with 95% Wilson CI [0.176, 0.281].
Calibration plot
Each dot is one of ten predicted-probability deciles. The dashed grey line is perfect calibration; the solid blue line is the OLS fit through the deciles (slope 0.82, intercept 0.09).
Reliability diagram
Bar height is the empirical observed-positive rate within each decile. Bar opacity scales with the number of predictions that landed in the bucket.
Per-sub-area Brier — 7 cs.AI sub-areas
Aggregate scores hide regression in any one sub-area. We track each of the seven cs.AI sub-areas separately so a regression in one can't get diluted by gains elsewhere. The bar marks the 95% Wilson interval; the dot marks the point estimate.
CitationVelocity weight
The Top-10 Daily Dashboard composes its rank from a base Signal-Fusion score plus a small per-paper velocity term:
signal_fusion_score = base_signal_fusion + velocity_weight * normalized_velocity_30d
Current velocity_weight= 0.30. The value is versioned in config/signal_fusion_weights.yaml and any change is routed through ShadowPromoter (kind=weight_promotion); BrierGate auto-reverts on calibration drop and starts the 7-day cooldown. velocity_weight=0 reproduces the prior P3 ranking byte-for-byte (zero-regression invariant).
How to verify
- Read the JSON twin at /api/methodology.json — same numbers, agent-readable.
- Pull the open-source Brier helpers from
apps/web/lib/methodology/brier.tsand re-run the math against your own holdout. - Inspect the recompute cron at
apps/api/cron/methodology_recompute.py(06:30 UTC daily, after the 06:00 scorecard sweep).