Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement | ScienceToStartup | ScienceToStartup