Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks | ScienceToStartup | ScienceToStartup