Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects | ScienceToStartup | ScienceToStartup