StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues | ScienceToStartup | ScienceToStartup