The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding | Signal Canvas | ScienceToStartup