ARXIV:2605.09719 · 3D VISION-LANGUAGE MODELS · SUBMITTED 12 MAY · 20:16 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Alaa Asfour · Christopher Indris · Leihan Chen · Tejas Vyas · Guanghui Wang · arXiv

A knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs.

Ship in 2-4 weeks›Score5.0Evidence unverified

Opportunity summary

Pain A knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a…

METHOD

Full abstract

Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce "Hidden CoT": learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on ScanNet and 3D-FRONT show strong spatial understanding, reaching 68-72% accuracy on proximity and contact tasks. Our framework enables efficient 3D scene QA on resource-constrained platforms.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. A public repository…

WHY NOW

3D Vision-Language Models moved forward this cycle; last verified May 2026. Public score 5.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainA knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs.

Segment

3D Vision-Language Models

Adoption evidence

Public code linked for build inspection

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "5ae061ed-e97f-4b5a-a136-cc3e6816bc6c", "arxiv_id": "2605.09719", "canonical_route": "/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot", "endpoints": { "paper_pack": "/api/v1/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot/paper-pack", "build_passport": "/api/v1/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT", "normalized_query": "2605.09719", "route": "/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot", "paper_ref": "distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot#webpage", "url": "https://sciencetostartup.com/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot", "name": "Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT", "description": "A knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot#scholarlyArticle", "headline": "Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT", "description": "A knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs.", "url": "https://sciencetostartup.com/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot", "sameAs": "https://arxiv.org/abs/2605.09719", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.09719" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-10T19:38:29.000Z", "author": [ { "@type": "Person", "name": "Alaa Asfour" }, { "@type": "Person", "name": "Christopher Indris" }, { "@type": "Person", "name": "Leihan Chen" }, { "@type": "Person", "name": "Tejas Vyas" }, { "@type": "Person", "name": "Guanghui Wang" } ], "codeRepository": "https://github.com/alaaasfour/distilled-LLaVA3D-with-CoT", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "3D Vision-Language Models" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot#software", "name": "Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT - Source Code", "description": "A knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs.", "codeRepository": "https://github.com/alaaasfour/distilled-LLaVA3D-with-CoT", "url": "https://github.com/alaaasfour/distilled-LLaVA3D-with-CoT" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "3D Vision-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Distilling 3D Spatial Reasoning into a Lightweight Vision-La", "item": "https://sciencetostartup.com/paper/distilling-3d-spatial-reasoning-into-a-lightweight-vision-language-model-with-cot" } ] } ] }

Competitive landscape

A knowledge distillation framework creates a lightweight vision-language model for efficient 3D spatial reasoning with reduced computational costs.

Segment

3D Vision-Language Models

Adoption evidence

Public code linked for build inspection

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline