All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting | ScienceToStartup | ScienceToStartup

PDF Viewer

100%

BUILDER'S SANDBOX

Build This Paper

Use an AI coding agent to implement this research.

OpenAI CodexAI Agent

Lightweight coding agent in your terminal.

Claude CodeAI Agent

Agentic coding tool for terminal workflows.

AntiGravity IDEScaffolding

AI agent mindset installer and workflow scaffolder.

CursorIDE

AI-first code editor built on VS Code.

VS CodeIDE

Free, open-source editor by Microsoft.

Recommended Stack

FastAPIBackend

PyTorchML Framework

TensorFlowML Framework

JAXML Framework

KerasML Framework

Startup Essentials

Render

Deploy Backend

Railway

Full-Stack Deploy

Supabase

Backend & Auth

Vercel

Deploy Frontend

Firebase

Google Backend

Hugging Face Hub

ML Model Hub

Banana.dev

GPU Inference

Antigravity

AI Agent IDE

MVP Investment

$9K - $12K

6-10 weeks

Engineering

$8,000

Cloud Hosting

$240

SaaS Stack

$300

Domain & Legal

$100

6mo ROI

0.5-1.5x

3yr ROI

5-12x

Computer vision products require more validation time. Hardware integrations may slow early revenue, but $100K+ deals at 3yr are common.

Talent Scout

Zeyu Zhang

Northwestern University

Ryan Chen

Northwestern University

Bradly C. Stadie

Northwestern University and Bridgewater AIA Labs

Find Similar Experts

Temporal experts on LinkedIn & GitHub

References (23)

[1]

Profit Mirage: Revisiting Information Leakage in LLM-based Financial Agents

2025Xiangyu Li, Yawen Zeng et al.

[2]

LAMP: Extracting Locally Linear Decision Surfaces from LLM World Models

2025Ryan Chen, Youngmin Ko et al.

[3]

EvolveBench: A Comprehensive Benchmark for Assessing Temporal Awareness in LLMs on Evolving Knowledge

Founder's Pitch

"Detect and mitigate temporal contamination in historical backtesting of LLMs."

Temporal Contamination Detection•Score: 5•View PDF ↗

Commercial Viability Breakdown

0-10 scale

High Potential

1/4 signals

2.5

Quick Build

3/4 signals

7.5

Series A Potential

2/4 signals

Sources used for this analysis

arXiv Paper

Full-text PDF analysis of the research paper

GitHub Repository

Code availability, stars, and contributor activity

Citation Network

Semantic Scholar citations and co-citation patterns

Community Predictions

Crowd-sourced unicorn probability assessments

Analysis model: GPT-4o · Last scored: 4/2/2026

🔭 Research Neighborhood

Generating constellation...

~3-8 seconds

Why It Matters

Without properly addressing temporal contamination, evaluations of LLMs on historical data may provide inaccurate or inflated performance results, undermining their reliability for future forecasting tasks.

Product Angle

A potential product could involve an API service for financial and legal sectors to ensure forecasting models are free from post-cutoff data bias, guaranteeing clearer attribution of prediction performance.

Disruption

The solution could replace or enhance existing heuristic-based systems for de-biasing historical model evaluations in financial and legal predictions.

Product Opportunity

The market spans financial institutions like fund managers, legal analytics firms, or any entity needing reliable predictive modeling without the risk of hindsight bias. These institutions invest heavily in analytics to optimize decision-making.

Use Case Idea

Commercial application in ensuring the reliability and accuracy of LLM-driven financial forecasting tools by eliminating temporal contamination from predictions.

Science

The paper introduces methods to detect and measure temporal knowledge leakage in large language models (LLMs) when backtesting predictions based on historical data by decomposing predictions into atomic, verifiable claims and applying Shapley values to assess information leakage.

Method & Eval

The method was validated on datasets from three domains: Supreme Court case predictions, NBA salary estimations, and stock return rankings, demonstrating significant reduction in decision-critical leakage while retaining performance.

Caveats

The method is computationally intensive given its reliance on external verification and Shapley value computations, potentially limiting scalability and real-time application.

Author Intelligence

Zeyu Zhang

Northwestern University

zeyuzhang2028@u.northwestern.edu

Ryan Chen

Northwestern University

Bradly C. Stadie

Northwestern University and Bridgewater AIA Labs

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

BUILDER'S SANDBOX

Build This Paper

Recommended Stack

Startup Essentials

MVP Investment

Talent Scout

References (23)

Founder's Pitch

"Detect and mitigate temporal contamination in historical backtesting of LLMs."

Commercial Viability Breakdown

🔭 Research Neighborhood

Why It Matters

Product Angle

Disruption

Product Opportunity

Use Case Idea

Science

Method & Eval

Caveats

Author Intelligence

Zeyu Zhang

Ryan Chen

Bradly C. Stadie

Related Papers