Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks | ScienceToStartup