Softmax gradient policy for variance minimization and risk-averse multi armed bandits | ScienceToStartup | ScienceToStartup