How can researchers develop more robust evaluation metrics for LLM performance on rare entities?Answer not yet generated.