Softmax gradient policy for variance minimization and risk-averse multi armed bandits | Signal Canvas | ScienceToStartup