Quantifying construct validity in large language model evaluations | ScienceToStartup | ScienceToStartup